Intermediate By Samson Tanimawo, PhD Published Aug 26, 2026 5 min read

Runbook Skeleton

Twelve sections, one page. The structure that survives a 3am page from someone who's never seen this service before. Copy the skeleton, fill in the blanks, save it next to the alert that triggers it.

The first three sections answer "what am I looking at?" in 30 seconds. If the next on-call has to read past these to know what's happening, the runbook has already failed.

4–5, Verify the alert is real

Half the time the alert is the bug. Verify before you mitigate.

6–8, Mitigate, rollback, escalate

The action sections. Mitigate is the "stop the bleeding" step; rollback is the "use this if mitigate doesn't work"; escalate is the "use this if you're stuck after 15 minutes".

9, Comms

Customer-facing communication. The single most-skipped section. Pre-write the language so the on-call doesn't have to wordsmith at 3am.

10, Diagnostics & data to capture

The "before you reboot it" reminder. Once a process is restarted, the live state is gone forever. The runbook reminds the on-call to grab it first.

11–12, Close-out & learn