Alert Fatigue: The Math of Why It Happens and How to Reverse It
Alert fatigue is not a culture problem; it is an arithmetic problem. The math is simple; the discipline to fix it is rare.
Why volume always grows
Alerts are added during incidents (“we should have caught that”) and rarely retired (“might be useful someday”). Net flow is monotonically up.
Without active pruning, every team converges on the same state: too many alerts, most ignored.
The three driving forces
- 1. Reactive addition. Every postmortem adds 1-3 alerts; nothing balances them.
- 2. False precision. Threshold alerts fire at noise.
- 3. Missing severity discipline. Pages and tickets get conflated; everything wakes someone.
The four-quarter reversal
Q1: Inventory + tag every alert by signal strength. Q2: Retire bottom 30% by signal. Q3: Convert remaining threshold alerts to burn-rate. Q4: Establish quarterly review as a permanent practice.
Each quarter is one focused pass. Cumulative effect: 60% fewer pages, same incident-detection capability.
Measuring success
Pages per shift is the only metric that matters. Track it weekly; share with the team.
When the metric drops below 2 pages/shift sustainably, the discipline is working. Above 5, the program needs attention.
Antipatterns
- Adding alerts without retiring any. Net flow always up.
- One severity for everything. Pages stop being signal.
- Quarterly “alert review” that just rubber-stamps existing alerts. Plan to retire 20%, not 0%.
What to do this week
Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.