Alert Fatigue: The Real Cost
Alert fatigue costs more than 'tired engineers.' The real cost: missed real incidents, attrition, eroded trust.
What alert fatigue actually costs
The cost of alert fatigue is usually framed as “noise.” The real cost is detection loss, attrition, and a permanent shift in how the team treats every signal that lands.
- Lost detection. Engineers ignore the third PagerDuty page in an hour. The fourth page is the real outage and sits unacknowledged for 18 minutes while the on-call clears noise.
- Attrition. SREs name a noisy rotation as the top reason they leave. Replacing a senior on-call costs 6 to 9 months of ramp plus recruiter fees north of $25k.
- Eroded trust. Once a team learns alerts lie, they stop investigating any of them. The next genuine signal gets the same shrug as the noise.
- Customer cost. Slower MTTA on real incidents lands in the SLA penalty column at renewal. The fatigue tax shows up on the invoice.
How to measure it
Fatigue is measurable. The four signals below are leading indicators that move weeks before resignations or incident misses do.
- Pages per shift, ack distribution, auto-resolve ratio. If more than 60 percent of pages auto-resolve before human action, the threshold is wrong, not the system.
- Per-rotation survey. One question after each rotation: how many pages were actionable. Below 70 percent means the rotation is broken.
- Alertmanager silenced and inhibited counters. Steady growth in silences is a leading indicator of fatigue.
- Time-of-night skew. Pages clustered between 1am and 5am should be a small fraction of the total. Higher means the alerting is not respecting traffic curves.
What actually reduces noise
Three interventions move the noise floor more than any tooling purchase. The order matters: delete first, restructure next, group last.
- Delete dead alerts. Anything that fired in the last 90 days but produced no ticket and no remediation. Default to deletion, not muting.
- Move flap-prone alerts to burn-rate SLO. A 2-hour 14.4x burn-rate page is worth waking someone for; a 30-second 90 percent CPU spike is not.
- Group on root cause. Alertmanager
group_byplus a 5-minutegroup_waitcollapses 40 simultaneous pod-down alerts into one page. - Promote informational signals. Move capacity warnings and trend alerts to dashboards, never PagerDuty. The page channel stays scarce.
The alert budget
The alert budget is the discipline that keeps the catalog clean over time. Without an explicit cap, every new alert is additive and the catalog rots.
- Hard cap. 2 pages per shift on average, 5 maximum in any 24-hour window. A new alert that pushes the team over forces an older one to retire.
- Alerts as code. Every new rule needs a runbook URL, an owner team, and an expiry date that forces re-justification.
- Quarterly review. Anything firing more than 10 times per month without a ticket is auto-flagged for deletion at review time.
- Visible budget. The on-call rotation channel publishes the budget consumed each week. Visibility forces tuning faster than any policy memo.
Where to start this week
The discipline ships in three concrete steps. None require permission outside the on-call team.
- 30-day pull. Pull the last 30 days of pages. Sort by alert name, count occurrences, join against the incident tracker. The bottom half is your delete pile.
- Pager budget on the roadmap. Add an explicit pager budget to the SRE roadmap. Make it a leadership commitment, not a wishlist item.
- Skip vendor noise-reduction tools. Deduping bad rules just buries them deeper. Clean the rules first.
- Stamp expiry dates. Walk the catalog and add an expiry date to every alert. Anything missing one becomes the first review pile.