Alert-Driven Runbook Updates
Each unhandled alert reveals a runbook gap. Track them.
The pattern
Every alert that fires without a clear runbook step is a runbook gap. Track them as work items, not as moments of frustration; most teams discover gaps mid-incident then forget to fix them once the page closes, so the fix only happens when the gap lives in a backlog. Treat runbook coverage as a service-level metric, not a documentation chore.
- Alert-without-runbook = gap. Track as work items, not frustration moments.
- Mid-incident discovery decays. Teams forget once the page closes; backlog captures the work.
- Service-level metric. Runbook coverage is operational discipline, not a documentation chore.
- Per-team coverage tracking. The metric drives the work; supports continued investment.
How to track
Three mechanisms make runbook gaps visible. Add a “runbook clear?” field to the post-incident review template (yes, partial, no; roll up monthly per service); on-call posts a one-line note in the incident channel (“runbook covered this” or “runbook missed step X”) captured in the timeline; open a JIRA ticket per gap with 2-week SLA on closing.
- Post-incident review field. Yes, partial, no; rolled up monthly per service.
- One-line in-channel note. “Runbook covered this” or “missed step X”; captured in timeline.
- JIRA ticket per gap. Assigned to owning team; 2-week SLA on closing.
- Per-service rollup monthly. Coverage trend visible at the team level; supports continued attention.
What a good runbook looks like
A good runbook has three sections. Detect: how do I confirm this is real? Mitigate: what stops the bleeding now? Fix: what addresses the root cause later? Use concrete commands, not concepts (“run kubectl rollout undo deployment/checkout” beats “roll back the last deploy”); link to dashboards not just metric names.
- Detect-mitigate-fix. Three sections; the on-call walks through them in order.
- Concrete commands. kubectl rollout undo beats “roll back the last deploy”; copy-paste ready.
- Dashboard URLs not metric names. The on-call needs the URL, not the lookup.
- Per-runbook standard structure. The three sections form a template; supports consistent coverage.
Runbook rot
Runbooks rot in 6 months. Commands change, dashboard URLs break, owners leave; run a quarterly drill where you pick 5 random runbooks and have a non-author execute the first 3 steps (anything that fails gets re-written); tie runbook freshness to service tier (tier 1 services get quarterly drills, tier 3 can wait 6 months).
- 6-month half-life. Commands change, URLs break, owners leave; freshness is permanent overhead.
- Quarterly drill. Non-author executes first 3 steps; failures trigger rewrite.
- Tier-based cadence. Tier 1 quarterly, tier 3 every 6 months; the cadence matches stakes.
- Per-drill action capture. Each drill produces named rewrites; supports continued accuracy.
Get started
The starter ramp is concrete. Pull every alert that fired in the last 30 days and tag each with “had runbook”, “runbook stale”, or “no runbook”; open one ticket per gap grouped by service (estimate rarely above 2 hours per runbook); make runbook coverage a service-readiness gate where new services don’t graduate to on-call rotation until coverage is above 80%.
- 30-day alert pull and tag. Had runbook, runbook stale, no runbook; the audit baseline.
- One ticket per gap, grouped by service. Estimate rarely above 2 hours per runbook.
- 80% coverage gate for new services. No graduation to on-call rotation until threshold met.
- Per-quarter coverage trend. Documented per cycle; supports continued investment.