Alert Deduplication: Noise Reduction That Actually Works
Dedup is the highest-impact on-call improvement. The pipeline pattern below cuts page volume by 60-80%.
Why dedup matters
Dedup is the highest-leverage on-call improvement most teams skip. The cost is small; the page-volume reduction lands the first week.
- Without dedup. One incident fires 50 separate pages from every monitor that can see it; on-call drowns in noise.
- With dedup. One page, clear signal, faster mental triage; MTTR drops because attention is not split.
- Volume impact. Properly-tuned dedup cuts page volume by 60 to 80% without losing signal.
- Burnout lever. Pages-per-shift drops mechanically; the team's tolerance for real incidents grows.
Four-stage pipeline
- 1. Event-time, same alert within N minutes = one event.
- 2. Label-based, same alert+labels = one event.
- 3. Similarity-based, ML-based similarity scoring.
- 4. Dependency-aware, parent alert suppresses children.
Tooling per stage
Each stage of the pipeline maps to specific tools. Pick by what your team already runs; do not buy a platform when Alertmanager covers 80% of the value.
- Stage 1-2 (event-time, label). Alertmanager and PagerDuty native; configured once, runs forever.
- Stage 3 (similarity). ML platforms (Moogsoft, BigPanda, native AIOps in Datadog or Splunk).
- Stage 4 (dependency). Service-graph aware tools (e.g. PagerDuty's intelligent grouping, internal CMDB integrations).
- Open-source path. Karma plus Alertmanager covers stages 1-2 for free; stages 3-4 typically need vendor or build.
False-merge audit
Aggressive dedup hides real signals occasionally. The audit is the safety net that keeps confidence in the pipeline.
- Weekly review. Owner reads the merged-event log; spot-checks 10 random merges for correctness.
- Verify distinct incidents. Two distinct services failing at once must not collapse into one event.
- Tune on findings. One false merge in a quarter is acceptable; a recurring pattern means a rule needs adjusting.
- Without audit. Dedup quietly hides real signals; the team only finds out during the postmortem.
Antipatterns
- No dedup. Page flood.
- Aggressive ML dedup without audit. Hidden incidents.
- Different dedup per source. Inconsistent signal.
What to do this week
Three moves. (1) Apply this practice to your next on-call rotation. (2) Survey the team after one cycle. (3) Iterate based on feedback; the discipline is the cadence.