Alert-Driven Runbook Updates

Each unhandled alert reveals a runbook gap. Track them.

The pattern

Every alert that fires without a clear runbook step is a runbook gap. Track them as work items, not as moments of frustration; most teams discover gaps mid-incident then forget to fix them once the page closes, so the fix only happens when the gap lives in a backlog. Treat runbook coverage as a service-level metric, not a documentation chore.

How to track

Three mechanisms make runbook gaps visible. Add a “runbook clear?” field to the post-incident review template (yes, partial, no; roll up monthly per service); on-call posts a one-line note in the incident channel (“runbook covered this” or “runbook missed step X”) captured in the timeline; open a JIRA ticket per gap with 2-week SLA on closing.

What a good runbook looks like

A good runbook has three sections. Detect: how do I confirm this is real? Mitigate: what stops the bleeding now? Fix: what addresses the root cause later? Use concrete commands, not concepts (“run kubectl rollout undo deployment/checkout” beats “roll back the last deploy”); link to dashboards not just metric names.

Runbook rot

Runbooks rot in 6 months. Commands change, dashboard URLs break, owners leave; run a quarterly drill where you pick 5 random runbooks and have a non-author execute the first 3 steps (anything that fails gets re-written); tie runbook freshness to service tier (tier 1 services get quarterly drills, tier 3 can wait 6 months).

Get started

The starter ramp is concrete. Pull every alert that fired in the last 30 days and tag each with “had runbook”, “runbook stale”, or “no runbook”; open one ticket per gap grouped by service (estimate rarely above 2 hours per runbook); make runbook coverage a service-readiness gate where new services don’t graduate to on-call rotation until coverage is above 80%.