Alert Grouping & Deduplication: A Practical Guide for SRE Teams
A single failure can produce 200 alerts. A single misconfigured grouping rule can collapse two real incidents into one and silently delay response by 30 minutes. Here is how to get grouping and dedup right.
Why Grouping and Deduplication Matter
Modern infrastructure produces alert storms. A single database outage can fire alerts from every dependent service (every API endpoint, every background job, every health check) within seconds. Without grouping, your on-call engineer wakes up to 200 separate pages and has to reconstruct what is actually happening. With aggressive grouping, the same engineer gets one page that says "payment-svc unavailable due to database failover," which is dramatically better.
The trade-off is that grouping done wrong silently combines two real incidents and delays response. The art is finding the boundary: aggressive enough to collapse storms from a single root cause, conservative enough to keep two unrelated incidents separated.
Deduplication: Same Alert, Many Times
Deduplication is the simpler problem: the same alert keeps firing because the underlying condition has not been fixed. Every modern alerting tool handles dedup natively by maintaining alert state and only firing the notification once per "active alert lifetime."
The implementation varies:
- Prometheus Alertmanager uses a fingerprint of the alert's labels. Two firings with identical labels are treated as one alert; the notification fires once when the alert opens and again when it resolves.
- PagerDuty uses an incident_key field on incoming events. Two events with the same key update an existing incident rather than creating a new one.
- Opsgenie uses an alias field for the same purpose.
The dedup logic works well as long as the labels (or incident_key, or alias) are consistent across firings. The trap is when dynamic content sneaks into the dedup key. A common bug: the alert labels include a timestamp or a request_id, which changes on every firing. The dedup logic then treats every firing as a new alert, and the on-call gets paged 100 times for the same condition.
Fix: dedup keys should contain only stable identifiers (service name, alert name, severity), never dynamic content (timestamps, request IDs, host IDs that change on auto-scaling).
Grouping Strategy 1: By Service
The simplest and most useful grouping. All alerts for a single service collapse into one notification, regardless of which specific check fired.
# Alertmanager grouping by service
route:
group_by: ['service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
The group_wait: 30s means Alertmanager waits 30 seconds after the first alert before firing the notification, batching any additional same-service alerts that arrive in that window. group_interval: 5m means subsequent alerts on the same group fire at most every 5 minutes.
When to use: Default starting point for any team. Works well for service-oriented architectures where each service has a single owning team.
Trade-off: A single service can have legitimately separate problems happening at the same time (a slow query and an OOM in different parts of the codebase). Service-only grouping collapses them into one notification, which can mask the second problem.
Grouping Strategy 2: By Cluster or Region
Group by infrastructure scope: all alerts in cluster prod-us-east-1 collapse into one notification, regardless of which service. Useful when the failure mode is infrastructure-wide (a node failure, a network partition, a region outage).
route:
group_by: ['cluster', 'region']
group_wait: 30s
routes:
- match:
severity: critical
group_by: ['service'] # but critical alerts stay per-service
When to use: Multi-region or multi-cluster setups where infrastructure failures should produce one regional page rather than 50 per-service pages.
Trade-off: Hides per-service signal during region-wide events. Often paired with severity-based exceptions: critical alerts always page per-service, lower severity alerts collapse by region.
Grouping Strategy 3: By Time Window
Collapse all alerts that arrive within a defined time window into a single notification. The simplest implementation in Alertmanager is the group_wait setting; longer windows can be implemented with group_interval.
For more sophisticated time-windowed grouping, use a downstream tool like BigPanda, Moogsoft, or Nova AI Ops that supports rolling-window correlation: any alerts arriving within a 10-minute sliding window with overlapping labels are grouped together.
When to use: Deployment-related incidents where many services degrade simultaneously following a bad release. The time-window collapses the storm into "something happened around 14:32 affecting these 7 services."
Trade-off: Two genuinely unrelated incidents that happen to fire within the window get combined. Rare but real.
Grouping Strategy 4: By Causal Topology (AI)
The most sophisticated grouping uses the actual causal relationships between services. If service A depends on service B, and both fire alerts within a short window, the system infers that A's alert is downstream of B's failure and groups them. The notification reads "B is down, causing A to fail," not "A and B both have problems."
This requires three things: an accurate service-dependency map (often built from distributed tracing data), a real-time correlation engine, and the ability to propose causal hypotheses with confidence scores. Tools that do this well in 2026 include Dynatrace Davis, BigPanda Open Box ML, and Nova AI Ops.
When to use: Microservices architectures with deep dependency chains where individual service alerts give you no insight into the actual failure cascade.
Trade-off: Requires investment in topology data (service mesh, tracing, change events) and tolerance for occasional miscorrelation when the topology is incomplete.
The Three Dedup Pitfalls That Hide Real Incidents
Three failure modes that look like good dedup and silently swallow real signals:
Pitfall 1: Auto-resolve drops the alert before anyone investigates. An alert fires, then the underlying condition oscillates back to healthy for a few seconds, and Alertmanager treats this as resolution. The alert never reaches the on-call. Symptom: incidents that "fixed themselves" and recur an hour later when the condition fully degrades. Fix: increase the alert's for duration so it must persist before firing, and require the resolved state to persist before clearing.
Pitfall 2: Dedup window collapses sequential incidents. Two genuinely separate incidents fire within a long dedup window and get treated as one. The second incident's context is lost. Symptom: post-mortems that read "we thought we'd fixed it but a second issue was masked." Fix: dedup windows should be measured in minutes, not hours.
Pitfall 3: Group_by drops the most informative label. A poorly chosen group_by set drops the label that distinguishes two real failure modes. Symptom: on-call thinks they are looking at one problem when they are actually looking at two. Fix: always include severity in group_by at minimum, and review grouping rules quarterly against actual incidents.
A Worked Example: Database Outage
A primary Postgres instance fails over to a replica. The failover takes 12 seconds. During those 12 seconds, here is what fires:
- Database health check: 1 alert
- Connection pool errors from 8 dependent services: 8 alerts
- HTTP 5xx rate alerts from 6 user-facing services: 6 alerts
- Latency alerts from 12 internal services: 12 alerts
- Queue backup alerts from 4 background workers: 4 alerts
- Synthetic monitor alerts from 5 production probes: 5 alerts
Total: 36 alerts in 12 seconds from a single root cause.
With basic service-grouping, these collapse into ~36 separate notifications (one per service that fired), better than 36 individual alerts but still overwhelming.
With time-window grouping (10-minute window), they collapse into a single notification per affected region, but the notification text is just "many alerts firing" without root cause.
With causal-topology grouping, the system identifies the database failover as the root cause and produces one notification: "DB failover at 14:32:18 caused dependent service failures across 8 services. Failover completed at 14:32:30. Dependent services recovering." This is actionable. The on-call engineer knows what happened and what to investigate.
How AI Correlation Changes the Math
The traditional grouping strategies above (service, cluster, time-window) are all rule-based. They work well for the failure modes you anticipated and poorly for the ones you did not. AI correlation engines like Nova AI Ops learn the dependency topology and the historical co-occurrence patterns, and apply causal grouping that adapts as your architecture changes.
The result is a 90-95% reduction in pages reaching the on-call engineer, without losing visibility into the underlying alert volume (the full alert stream is still recorded for postmortem analysis). For a team paging 50 times a week, this is the difference between sustainable on-call and quiet attrition.
Try Nova to see how AI correlation handles your actual alert volume in production.