The Alert Decay Pattern
Old alerts that no longer matter. Auto-decay them.
What alert decay means
Alerts have a half-life. The system they monitor changes faster than the alert config. After 6-12 months, most alerts no longer reflect reality.
Decay shows up as: alerts firing on retired metrics, alerts pointing to teams that no longer own the service, runbooks that 404.
Treat alert age like dependency age. Old is suspicious; old and untouched is broken.
Auto-flag stale alerts
Tag every alert with last-edit date and last-fire date. If both are over 12 months old, auto-tag as stale.
Stale alerts get a quarterly email to the owning team. Three quarters without action and the alert is auto-archived.
This forces the team to either confirm the alert is still wanted or accept its retirement.
When recency lies
An alert that fired last week is not necessarily healthy. It might be flapping on a deprecated dependency.
Cross-check: does the runbook still resolve? Does the linked dashboard load? Does the owning team still exist?
Add an annual deep review on top of the staleness flag. The flag catches the obvious; the review catches the subtle.
Auto-archive process
Archive is not delete. The alert config moves to an /archive folder in git, with a tombstone reason.
Bring-back is one PR if the alert is needed again. Cheap to undo, cheap to do.
After 24 months in archive, hard-delete. Anything that was needed would have been restored by then.
How to roll out decay
Add last-edit and last-fire labels to your alerting tool. Datadog, PagerDuty, and Prometheus all support custom labels.
Build the staleness report once. Schedule it weekly into a Slack channel.
Resist the urge to argue per-alert. The point is to default to deletion; exceptions need owners willing to defend them.