The Alert Cleanup Discipline
Alerts accumulate. The cleanup that prevents alert sprawl.
Why alerts accumulate
Each new feature ships with a few alerts. Nobody removes the old ones. After 2 years a team owns 200 alerts and uses 30.
The cost is paid by on-call. Noisy alerts mask real ones, train responders to ignore pages, and burn out rotations.
Cleanup is not optional. Without scheduled pruning, alert sprawl is mathematically guaranteed.
The quarterly cleanup
Every quarter, each team reviews all alerts they own. Decision per alert: keep, retune, retire.
Retire any alert that fired zero times last quarter and is not a safety net. Retire any alert that fired but had no action taken.
Block the cleanup ritual on the on-call calendar. If it is not scheduled, it does not happen.
Metrics that drive the cleanup
Per-alert fire count last 90 days. Per-alert action-taken rate (was an incident opened?). Per-alert noise score (fires per resolution).
A noise score above 5 (fires 5x for every real incident) is a candidate for retirement or major retuning.
Surface the metrics in a dashboard the team sees weekly. Cleanup becomes a default behaviour, not a special event.
The political problem
Engineers are nervous about retiring alerts they wrote. "What if it catches something next quarter?"
Counter: every alert has a cost, paid in pages. An alert that catches one issue per year but pages 100 times is a net loss.
Make retirement reversible. The alert config goes to a retired/ folder in git, not deleted. Brought back if needed.
How to start cleanup this quarter
Pull a report of all alerts with zero fires in 90 days. Retire all of them. This alone usually cuts 30% of the catalog.
Then look at the top 10 noisiest alerts. Retune or retire each.
Repeat next quarter. Steady-state alert count should be flat or decreasing.