Alert Management Intermediate By Samson Tanimawo, PhD Published Dec 5, 2026 8 min read

Why Your Alert Should Have a Runbook (and a Test for It)

A runbook only helps if on-call can find it and trust it. Both are achievable; both rot without discipline.

Why runbooks rot

Runbooks are written once, then the system changes around them. The runbook references a service that has been renamed; a script that has been removed; a Slack channel that has been archived.

Within six months most runbooks are partly wrong; on-call learns to mistrust them.

Link from the alert itself

Every alert annotation should include the runbook URL. Modern Alertmanager and PagerDuty both support this.
One click from the page to the runbook eliminates the ‘where is the runbook’ first 5 minutes of every incident.

The runbook test in CI

Add a CI test: walk the runbook’s commands; check each command exists / each link resolves / each Slack channel is real.

Runbook drift fails CI; the engineer who broke it fixes it before merge.

Quarterly verification

Quarterly: pick three runbooks at random; the on-call rotation runs through them in a tabletop. Anything that breaks gets a real fix.

The discipline catches drift the CI test misses (e.g., runbook still references the right thing but the thing no longer behaves the same).

Antipatterns

Runbook in a wiki. Search is poor; on-call defaults to Slack-asking.
No expiry on runbooks. They never get reviewed.
One mega-runbook. Search inside fails at 3am.

What to do this week

Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.