The Alert Decay Pattern

Old alerts that no longer matter. Auto-decay them.

What alert decay means

Alerts have a half-life. The system they monitor changes faster than the alert config; after 6-12 months, most alerts no longer reflect reality. Decay shows up as alerts firing on retired metrics, alerts pointing to teams that no longer own the service, runbooks that 404. Treat alert age like dependency age: old is suspicious, old and untouched is broken.

Auto-flag stale alerts

Auto-flagging makes the decay visible. Tag every alert with last-edit date and last-fire date; if both are over 12 months old, auto-tag as stale; stale alerts get a quarterly email to the owning team and three quarters without action triggers auto-archive. This forces the team to confirm the alert is still wanted or accept retirement.

When recency lies

Recent fires don’t mean health. An alert that fired last week is not necessarily healthy because it might be flapping on a deprecated dependency; cross-check whether the runbook still resolves, the linked dashboard loads, the owning team still exists; add an annual deep review on top of the staleness flag because the flag catches the obvious and the review catches the subtle.

Auto-archive process

Archive is not delete. The alert config moves to an /archive folder in git with a tombstone reason; bring-back is one PR if the alert is needed again, cheap to undo and cheap to do; after 24 months in archive, hard-delete because anything that was needed would have been restored by then.

How to roll out decay

The rollout is concrete. Add last-edit and last-fire labels to your alerting tool (Datadog, PagerDuty, Prometheus all support custom labels); build the staleness report once and schedule it weekly into a Slack channel; resist the urge to argue per-alert because the point is to default to deletion and exceptions need owners willing to defend them.