Alerts Practical By Samson Tanimawo, PhD Published Mar 27, 2026 4 min read

Alert-Driven Runbook Updates

Each unhandled alert reveals a runbook gap. Track them.

The pattern

Every alert that fires without a clear runbook step is a runbook gap. Track them as work items, not as moments of frustration.

Most teams discover gaps mid-incident, then forget to fix them once the page closes. The fix only happens when the gap is in a backlog.

Treat runbook coverage as a service-level metric, not a documentation chore.

Add a "runbook clear?" field to your post-incident review template. Yes, partial, or no. Roll up monthly per service.

On-call posts a one-line note in the incident channel: "runbook covered this" or "runbook missed step X". Capture in the timeline.

Open a JIRA ticket per gap. Assign to the owning team. Set a 2-week SLA on closing it.

Three sections: detect (how do I confirm this is real?), mitigate (what stops the bleeding now?), fix (what addresses the root cause later?).

Concrete commands, not concepts. "Run kubectl rollout undo deployment/checkout" beats "roll back the last deploy".

Link to dashboards, not just metric names. The on-call needs the URL, not the lookup.

Runbooks rot in 6 months. Commands change, dashboard URLs break, owners leave.

Run a quarterly drill: pick 5 random runbooks and have a non-author execute the first 3 steps. Anything that fails gets re-written.

Tie runbook freshness to the service tier. Tier 1 services get quarterly drills; Tier 3 can wait 6 months.

Pull every alert that fired in the last 30 days. Tag each with "had runbook", "runbook stale", or "no runbook".

Open one ticket per gap, group by service. Estimate is rarely above 2 hours per runbook.

Make runbook coverage a service-readiness gate. New services do not graduate to on-call rotation until coverage is above 80%.