Alert Deduplication Strategy
Same incident, multiple alerts. Dedupe early.
Why dedupe
One real failure produces dozens of pages: pod crash, container restart, service unavailable, dependent service error rate spike, downstream timeout. Same root cause, six pagers.
Without dedupe the on-call clears 30 alerts to find the one that matters. With dedupe the on-call sees one grouped page with the dependent symptoms attached.
Dedupe is grouping plus inhibition plus correlation. All three layers matter.
Alertmanager grouping
Use group_by on shared labels: service, cluster, severity. group_wait of 30 seconds collapses near-simultaneous events.
group_interval of 5 minutes batches followups so the same incident doesn't repage every minute.
Wildcards in receivers are fine but match labels narrowly. Overly broad groups merge unrelated incidents.
Inhibition rules
When a cluster-wide alert fires, inhibit the per-pod alerts. The cluster issue subsumes them.
When upstream X is down, inhibit downstream Y's error-rate page. The downstream symptom is expected.
Document inhibition rules in the alert catalog. Hidden inhibitions confuse responders during partial failures.
Correlation across signals
AIops layers (BigPanda, Moogsoft, Nova) cluster events by topology and time. Useful when alert sources are heterogeneous.
Topology comes from CMDB or service mesh. Without topology, correlation is just timestamp clustering.
Validate clusters during incident review. A cluster that hides a real second incident is worse than no clustering at all.
Layer in this order
Start with Alertmanager group_by. It's free and handles 60% of dedupe needs.
Add inhibition for known parent-child relationships next. Document them.
Reach for AIops correlation only when the catalog spans multiple alert sources and topology is well-known.