The Alert Decay Pattern
Old alerts that no longer matter. Auto-decay them.
What alert decay means
Alerts have a half-life. The system they monitor changes faster than the alert config; after 6-12 months, most alerts no longer reflect reality. Decay shows up as alerts firing on retired metrics, alerts pointing to teams that no longer own the service, runbooks that 404. Treat alert age like dependency age: old is suspicious, old and untouched is broken.
- Alerts have a half-life. System changes faster than alert config; reality drifts.
- 6-12 month decay window. Most alerts no longer reflect reality after that.
- Decay symptoms. Retired metrics, wrong owner teams, 404’d runbooks.
- Treat age like dependency age. Old is suspicious; old and untouched is broken.
Auto-flag stale alerts
Auto-flagging makes the decay visible. Tag every alert with last-edit date and last-fire date; if both are over 12 months old, auto-tag as stale; stale alerts get a quarterly email to the owning team and three quarters without action triggers auto-archive. This forces the team to confirm the alert is still wanted or accept retirement.
- Last-edit and last-fire labels. Both over 12 months: auto-tag as stale.
- Quarterly email to owning team. Stale alerts surface to the team that owns them.
- Three-quarter no-action triggers archive. Default to retirement; the team must defend.
- Per-alert staleness flag. Visible in the alert catalog; supports continued curation.
When recency lies
Recent fires don’t mean health. An alert that fired last week is not necessarily healthy because it might be flapping on a deprecated dependency; cross-check whether the runbook still resolves, the linked dashboard loads, the owning team still exists; add an annual deep review on top of the staleness flag because the flag catches the obvious and the review catches the subtle.
- Recent fire is not health. Might be flapping on a deprecated dependency.
- Cross-check three things. Runbook resolves, dashboard loads, owning team exists.
- Annual deep review. Flag catches obvious; review catches subtle.
- Per-alert annual review. Even active alerts get checked; supports correctness.
Auto-archive process
Archive is not delete. The alert config moves to an /archive folder in git with a tombstone reason; bring-back is one PR if the alert is needed again, cheap to undo and cheap to do; after 24 months in archive, hard-delete because anything that was needed would have been restored by then.
- Archive folder in git. Move to
/archivewith a tombstone reason; not a hard delete. - One-PR restore. Cheap to undo if the alert is needed again.
- 24-month hard delete. Anything needed would have been restored by then.
- Per-alert tombstone. Documented archive reason; supports later forensics.
How to roll out decay
The rollout is concrete. Add last-edit and last-fire labels to your alerting tool (Datadog, PagerDuty, Prometheus all support custom labels); build the staleness report once and schedule it weekly into a Slack channel; resist the urge to argue per-alert because the point is to default to deletion and exceptions need owners willing to defend them.
- Add last-edit and last-fire labels. Datadog, PagerDuty, Prometheus all support custom labels.
- Weekly Slack staleness report. Built once; scheduled weekly; the visibility is the lever.
- Default to deletion. Exceptions need owners willing to defend; the discipline lives here.
- Per-quarter retirement count. Documented retirements per cycle; supports continued cleanup.