The Alert Cleanup Discipline
Alerts accumulate. The cleanup that prevents alert sprawl.
Why alerts accumulate
Each new feature ships with a few alerts and nobody removes the old ones. After 2 years a team owns 200 alerts and uses 30; the cost is paid by on-call because noisy alerts mask real ones, train responders to ignore pages, and burn out rotations. Cleanup is not optional because without scheduled pruning, alert sprawl is mathematically guaranteed.
- Per-feature alert addition. Each new feature adds alerts; nobody removes the old ones.
- 200 alerts, 30 useful. After 2 years; the dilution kills signal-to-noise.
- Cost paid by on-call. Noise masks real alerts, trains responders to ignore, burns rotations out.
- Sprawl is mathematically guaranteed. Without scheduled pruning, the catalog only grows.
The quarterly cleanup
Every quarter each team reviews all alerts they own. Decision per alert: keep, retune, retire; retire any alert that fired zero times last quarter and is not a safety net; retire any alert that fired but had no action taken; block the cleanup ritual on the on-call calendar because if it is not scheduled, it does not happen.
- Quarterly cadence. Each team reviews all alerts they own; the recurring discipline.
- Three decisions. Keep, retune, retire; per-alert verdict.
- Zero-fire retirement. Fired zero in 90 days and not a safety net: retire.
- Calendar block. If not scheduled, it does not happen; the discipline lives in the calendar.
Metrics that drive the cleanup
Three metrics drive cleanup decisions. Per-alert fire count last 90 days; per-alert action-taken rate (was an incident opened?); per-alert noise score (fires per resolution). A noise score above 5 (5 fires per real incident) is a candidate for retirement or major retuning; surface the metrics in a dashboard the team sees weekly so cleanup becomes default behaviour.
- Fire count last 90 days. The basic activity metric.
- Action-taken rate. Was an incident opened; the value indicator.
- Noise score > 5. 5 fires per real incident; retirement or retuning candidate.
- Weekly dashboard surface. Cleanup becomes default behaviour, not special event.
The political problem
Engineers are nervous about retiring alerts they wrote. “What if it catches something next quarter?” The counter: every alert has a cost paid in pages, and an alert that catches one issue per year but pages 100 times is a net loss. Make retirement reversible by moving the alert config to a retired/ folder in git rather than deleting it.
- Engineer nervousness. “What if it catches something?” the recurring concern.
- Cost in pages. 100 pages for one issue per year is a net loss; the math is clear.
- Reversible retirement. Move to
retired/folder in git; not deleted, recoverable. - Per-retirement audit. The retirement record persists; supports later restoration if needed.
How to start cleanup this quarter
The starter ramp is concrete. Pull a report of all alerts with zero fires in 90 days and retire all of them (usually cuts 30% of the catalog); then look at the top 10 noisiest alerts and retune or retire each; repeat next quarter because steady-state alert count should be flat or decreasing.
- Zero-fire 90-day retirement. Cuts 30% of the catalog in one pass; the easy first win.
- Top 10 noisiest. Retune or retire each; the highest-noise tail.
- Quarterly repeat. Steady-state should be flat or decreasing; growth means sprawl.
- Per-quarter delta tracked. Documented retirement count; supports continued investment.