Alert Fatigue: The Real Cost
Alert fatigue costs more than 'tired engineers.' The real cost: missed real incidents, attrition, eroded trust.
What alert fatigue actually costs
Lost detection. Engineers ignore the third PagerDuty page in an hour. The fourth page is the real outage and it sits unacknowledged for 18 minutes while the on-call clears noise.
Attrition. SREs quote a noisy rotation as the top reason they leave. Replacing a senior on-call costs 6 to 9 months of ramp time plus recruiter fees north of 25k.
Eroded trust. Once a team learns alerts lie, they stop investigating any of them. The next genuine signal gets the same shrug as the noise.
How to measure it
Track alerts-per-on-call-shift, ack-time distributions, and the auto-resolved ratio. If more than 60% of pages auto-resolve before human action, the threshold is wrong, not the system.
Survey on-call after each rotation. Ask one question: how many pages were actionable. Below 70% means the rotation is broken.
Use Prometheus alertmanager's silenced and inhibited counters. Steady growth in silences is a leading indicator of fatigue.
What actually reduces noise
Delete alerts that fired in the last 90 days but produced no ticket and no remediation. Default to deletion, not muting.
Move flap-prone CPU and memory alerts to burn-rate SLO alerts. A 2-hour 14.4x burn rate page is worth waking someone for; a 30-second 90% CPU spike is not.
Group on root cause. Alertmanager's group_by service plus a 5-minute group_wait collapses 40 simultaneous pod-down alerts into one page.
The alert budget
Set a hard cap. 2 pages per on-call shift on average, 5 maximum in any 24-hour window. If a new alert pushes the team over, an old alert gets retired first.
Treat alerts like code. Every new rule needs a runbook URL, an owner team, and an expiry date that forces re-justification.
Review the alert backlog quarterly. Anything firing more than 10 times per month without a ticket is auto-flagged for deletion.
Where to start this week
Pull last 30 days of pages. Sort by alert name, count occurrences, and join against the incident tracker. The bottom half of that list is your delete pile.
Add an explicit pager budget to the SRE roadmap. Make it a leadership commitment, not a wishlist item.
Skip vendor noise-reduction features until the rule list is clean. Deduping bad rules just buries them deeper.