Runbook Skeleton
Twelve sections, one page. The structure that survives a 3am page from someone who's never seen this service before. Copy the skeleton, fill in the blanks, save it next to the alert that triggers it.
1–3, Header, trigger, blast radius
The first three sections answer "what am I looking at?" in 30 seconds. If the next on-call has to read past these to know what's happening, the runbook has already failed.
- 1. Title & alert link. "payment-svc, p99 latency > 2s". Link the literal alert that fires this runbook. Prompt: Could a stranger tell which alert this matches?
- 2. Trigger. The exact alert query / condition. PromQL, log filter, whatever. Prompt: Have I copy-pasted from the alert config so they can never drift?
- 3. Blast radius. Who's affected. "Checkout fails for ~2% of customers" is gold; "users may be impacted" is useless. Prompt: What's the customer-visible symptom, in one sentence?
4–5, Verify the alert is real
Half the time the alert is the bug. Verify before you mitigate.
- 4. Verify. The dashboard URL + the specific panel that confirms the customer impact. Not the alert query, the customer-side metric. Prompt: If the alert is wrong, will this dashboard say so?
- 5. Quick sanity checks. Three commands or links. "Is the LB returning 5xx? Are pods running? Did anything deploy in the last hour?" Prompt: Have I included the deploy check? It catches half of all incidents.
6–8, Mitigate, rollback, escalate
The action sections. Mitigate is the "stop the bleeding" step; rollback is the "use this if mitigate doesn't work"; escalate is the "use this if you're stuck after 15 minutes".
- 6. Mitigate. The first remediation, with exact commands. "Scale
payment-svcto 12 replicas:kubectl scale deploy/payment-svc --replicas=12 -n prod". Prompt: Will this command work without me thinking? No placeholders. - 7. Rollback. The "this was a deploy" path. Exact command + how to verify it worked. Prompt: Have I included the verify step? A rollback you can't confirm is worse than no rollback.
- 8. Escalate. Who to page after 15 minutes if not mitigated. Names + handles. Prompt: Is this the human, not just a team? Teams sleep; people answer phones.
9, Comms
Customer-facing communication. The single most-skipped section. Pre-write the language so the on-call doesn't have to wordsmith at 3am.
- 9a. Internal. Channel to post in, severity threshold for paging the comms lead, who chairs the war room.
- 9b. Customer-facing. Pre-written status page draft for this incident class. "We're aware of degraded checkout performance and are investigating." Prompt: Could I post this verbatim with one number changed?
- 9c. Approval. Who has to approve a public statement. With phone number for after-hours.
- If your incident is < Sev-2, comms can usually be a private update. Default the runbook to that and let the on-call escalate.
10, Diagnostics & data to capture
The "before you reboot it" reminder. Once a process is restarted, the live state is gone forever. The runbook reminds the on-call to grab it first.
- Logs,
kubectl logs payment-svc-xxxxx --previous > /tmp/inc-$(date +%s).log. The--previousis critical for crash loops. - Goroutine / thread dumps, whatever your language's equivalent is. kill -3,
jstack,SIGUSR1. - Heap snapshot, if the symptom is memory-related, take it before the OOMKill rotates the pod.
- Metrics window, bookmark or screenshot the dashboard with the incident time range. Kibana / Grafana URL with absolute timestamps.
- Database state, if you're going to truncate or modify, dump the rows first.
SELECT INTO temp_recovery_$ts FROM .... - Prompt: Have I named what gets lost if I just restart it?
11–12, Close-out & learn
- 11. Resolution criteria. "Mitigated when error rate < 0.5% for 10 minutes; resolved when root cause is identified and patched." Prompt: Could two people independently agree this incident is over?
- 12. Post-incident. Where to file the postmortem. Required Sev levels. Owner of the action items. Prompt: Will this trigger automatically, or does the on-call have to remember?
- The runbook itself ends with: "What surprised you in this incident? Add it here." A free-text section that next-time's on-call updates.
- Maintenance, review the runbook every quarter or after every page that uses it. Outdated runbooks are worse than no runbooks.
- Date-stamp every section that has live commands. "Last verified: 2026-04-15." Tells the next reader whether to trust it.
- Keep it under one printed page. If it's longer, you've embedded a tutorial; link to it instead.