The On-Call Paging Aggregation Policy
Multiple alerts in 5 minutes are usually one incident. The aggregation policy that prevents a 50-page incident.
The rule
On-call paging aggregation is the discipline of grouping related alerts into single notifications instead of paging the on-call once per alert. A real incident often produces dozens of alerts within minutes; without aggregation, the on-call gets paged dozens of times for what is effectively one situation. The aggregation reduces cognitive load and preserves the on-call's attention.
What the aggregation rule looks like:
- Alerts within 5 minutes of each other for the same service.: Two alerts from the same service within 5 minutes are likely related to the same incident. The aggregation rule combines them into one notification.
- Aggregated into one notification.: The on-call gets one page, not many. The single notification represents the incident; the underlying alerts are accessible from the notification but do not produce separate pages.
- The aggregation has a list of contributing alerts.: Inside the notification, the on-call sees all the contributing alerts. The list includes timestamps, services, severities, and brief descriptions. The full picture is one click away.
- The on-call sees the full picture.: Without aggregation, the on-call has to mentally combine many separate notifications. With aggregation, the system has done the combining; the on-call sees the incident as one event.
- Time window is configurable.: The 5-minute window is a starting point. Some teams use 2-minute windows for tight aggregation; others use 10-minute windows for looser. The right window depends on the team's incident patterns.
The aggregation rule is the foundation. Different scope and severity rules build on it.
Scope
The scope rules determine which alerts can aggregate together. Per-service aggregation is the most common; cross-service aggregation is sometimes appropriate but requires care.
- Per service.: The default scope is per service. Two alerts on service A aggregate; an alert on service A and an alert on service B do not. The scope reflects the typical incident pattern: incidents are usually scoped to a service.
- Cross-service alerts do not aggregate.: An alert on service A and an alert on service B might be separate incidents. Aggregating them would hide one incident under another; the on-call sees only one notification but is dealing with two situations.
- They may be separate incidents.: The investigation distinguishes. Sometimes the cross-service alerts are related (cascading failures); sometimes they are independent. Aggregation policy assumes independence by default.
- Severity-aware.: High-severity alerts always page, even if they overlap with a lower-severity aggregation. The on-call should always know about critical issues; the aggregation does not suppress them.
- High-severity always pages.: The carve-out for high-severity prevents aggregation from masking real incidents. A critical alert that fires during a low-severity aggregation produces its own page; the on-call is alerted explicitly.
The scope rules prevent aggregation from going too far. Without them, related alerts get aggregated; with them, the aggregation matches the team's mental model.
Save
The benefits of aggregation are large. The on-call's attention is preserved; the incident is comprehensible; triage starts faster.
- Typical incident: 10-30 alerts collapse to 1-3 notifications.: A real incident produces many related alerts. Without aggregation, each is a separate notification. With aggregation, the team gets a small number of notifications that represent the situation.
- Cognitive load drops.: The on-call is processing 1-3 notifications instead of 10-30. The mental overhead is much lower; the on-call's attention is on the incident, not on the alerts.
- Triage starts with the merged view.: The on-call opens the aggregated notification and sees the full picture immediately. Triage starts with the comprehensive view rather than building it from many separate alerts.
- Faster comprehension.: Understanding the incident is faster when the alerts are grouped. The pattern is visible; the dependencies are clear; the response is informed by the aggregate.
- Reduced alert fatigue.: Over time, the reduced page volume preserves the on-call's responsiveness. The pages they receive are more meaningful; the response is correspondingly more attentive.
On-call paging aggregation policy is one of those alerting disciplines that pays off proportionally to alert volume. Nova AI Ops integrates with paging platforms, applies the aggregation rules, and produces the merged notifications that incident response actually needs.