The Alert Cleanup Discipline

Alerts accumulate. The cleanup that prevents alert sprawl.

Why alerts accumulate

Each new feature ships with a few alerts and nobody removes the old ones. After 2 years a team owns 200 alerts and uses 30; the cost is paid by on-call because noisy alerts mask real ones, train responders to ignore pages, and burn out rotations. Cleanup is not optional because without scheduled pruning, alert sprawl is mathematically guaranteed.

The quarterly cleanup

Every quarter each team reviews all alerts they own. Decision per alert: keep, retune, retire; retire any alert that fired zero times last quarter and is not a safety net; retire any alert that fired but had no action taken; block the cleanup ritual on the on-call calendar because if it is not scheduled, it does not happen.

Metrics that drive the cleanup

Three metrics drive cleanup decisions. Per-alert fire count last 90 days; per-alert action-taken rate (was an incident opened?); per-alert noise score (fires per resolution). A noise score above 5 (5 fires per real incident) is a candidate for retirement or major retuning; surface the metrics in a dashboard the team sees weekly so cleanup becomes default behaviour.

The political problem

Engineers are nervous about retiring alerts they wrote. “What if it catches something next quarter?” The counter: every alert has a cost paid in pages, and an alert that catches one issue per year but pages 100 times is a net loss. Make retirement reversible by moving the alert config to a retired/ folder in git rather than deleting it.

How to start cleanup this quarter

The starter ramp is concrete. Pull a report of all alerts with zero fires in 90 days and retire all of them (usually cuts 30% of the catalog); then look at the top 10 noisiest alerts and retune or retire each; repeat next quarter because steady-state alert count should be flat or decreasing.