Why Your Alert Should Have a Runbook (and a Test for It)
A runbook only helps if on-call can find it and trust it. Both are achievable; both rot without discipline.
Why runbooks rot
Runbooks are written once, then the system changes around them. The runbook references a service that has been renamed; a script that has been removed; a Slack channel that has been archived.
Within six months most runbooks are partly wrong; on-call learns to mistrust them.
Link from the alert itself
- Every alert annotation should include the runbook URL. Modern Alertmanager and PagerDuty both support this.
- One click from the page to the runbook eliminates the ‘where is the runbook’ first 5 minutes of every incident.
The runbook test in CI
Add a CI test: walk the runbook’s commands; check each command exists / each link resolves / each Slack channel is real.
Runbook drift fails CI; the engineer who broke it fixes it before merge.
Quarterly verification
Quarterly: pick three runbooks at random; the on-call rotation runs through them in a tabletop. Anything that breaks gets a real fix.
The discipline catches drift the CI test misses (e.g., runbook still references the right thing but the thing no longer behaves the same).
Antipatterns
- Runbook in a wiki. Search is poor; on-call defaults to Slack-asking.
- No expiry on runbooks. They never get reviewed.
- One mega-runbook. Search inside fails at 3am.
What to do this week
Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.