Best Practices Beginner By Samson Tanimawo, PhD Published Nov 25, 2025 5 min read

Runbook Quality: A Grading Rubric You Can Apply Today

Most teams have runbooks. Most runbooks are bad. The problem is not effort; it is the absence of a quality bar. A simple A-D rubric changes the conversation.

Why grading matters

Without a quality bar, "we have a runbook" is a tick-box that hides whether the runbook actually works. Engineers bookmark them, never test them, and discover during the next incident that the runbook references a tool that was deprecated in 2024.

The cost of unmaintained runbooks. The on-call gets paged at 3am, opens the runbook, follows the steps, hits a 404 on a tool URL, then has to figure out the alternative from scratch. The runbook didn't help; it actively wasted 15 minutes of the on-caller's time. Most teams have many runbooks in this state and don't know it.

The grading exercise's value. It surfaces the actual quality. Most teams discover their runbook quality is uneven — a few are great (the ones the team uses regularly), most are stale (last updated months ago), some are catastrophically bad (reference systems that don't exist anymore). The grading shows the team where to invest.

Four criteria

An A runbook scores well on all four; a D runbook fails most of them. Most production runbooks are B-minus to C-plus. Aim for B or better.

The four criteria are independent. A runbook can be perfectly executable but stale; specific but missing the undo path. Each criterion catches a different failure mode; addressing one doesn't fix the others.

The realistic expectation. Most teams have 50-100 runbooks; reaching A on all of them is unrealistic. Aim for: top-10 most-used runbooks at A; everything else at B+. The 80/20 produces most of the value at sustainable cost.

Executability

Can a competent engineer who has never seen this runbook execute it in the middle of the night? If steps are vague ("scale up the service"), the runbook is decoration. If steps are specific commands with expected output, the runbook is real.

The specificity test. "Scale up the service" fails. "Run kubectl scale deployment api --replicas=10 -n prod; verify with kubectl get pods -n prod | grep api that 10 pods are running" passes. The first requires the on-caller to remember the command; the second is copy-pasteable at 3am.

The "expected output" detail. The runbook should tell the engineer what success looks like. Without it, the engineer runs the command and isn't sure whether it worked. With it, the engineer has a verification step embedded in the runbook itself.

Freshness

When was it last validated end to end? Older than 6 months and you should assume it is stale. Older than 12 months and you should assume it does not work. Add a timestamp; treat it as part of the runbook.

The validation discipline. Validation isn't reading the runbook; it's running it. Walk through every step against a real system (preferably staging); confirm each step works. Without execution, "we updated the runbook" doesn't catch broken commands.

The timestamp's role. Visible at the top of the runbook. "Last validated: 2026-03-15." The on-caller sees the date; if it's older than 6 months, they treat each step with extra suspicion. The visible date is a small social pressure that gets runbooks updated.

Scope

Does the runbook cover the actual failure mode, or a textbook version that does not occur? Specific runbooks ("the queue depth is over 100k") earn their keep. Generic ones ("if the service is slow") cover everything and help with nothing.

The specificity vs. generality trade-off. Specific runbooks are more useful but require more work to maintain. Generic ones are easier to write but less actionable. The right answer is specific runbooks for the most common 10-15 incidents and a generic "investigation" runbook for everything else.

The discipline of writing specific runbooks. After each incident, ask: "should this be a runbook?" If the incident is likely to recur (based on history or system understanding), yes. The runbook's title is the trigger condition: "Queue depth alert >100k", "Database connection pool exhausted", "Cert expiry warning."

Undo path

Every step that mutates state needs a stated way to undo it. A runbook that says "increase the connection pool" without saying "to revert: re-deploy with the previous value" sets up the next incident.

The undo path's necessity. The runbook's mitigation might be wrong, or might cause a different problem. The on-caller needs to know how to back out of each step. Without explicit undo paths, mitigations stack and the system reaches a state nobody knows how to revert.

What a good undo path looks like. Specific (commands to run, not "redeploy"). Bounded (says how long the undo takes). Verified (the runbook author actually tested the undo). Without verification, the undo path may not work; teams discover this in production.

How to grade

Spend 30 minutes per runbook. Walk a teammate through it without doing the operation. If they can describe what would happen without ambiguity, A or B. If they end up asking clarifying questions, C. If they ask whether the tool still exists, D.

The teammate-walkthrough method. The grader doesn't execute the runbook; they imagine executing it while reading. Each ambiguity, missing detail, or "I don't know what would happen here" is a deduction. The exercise is fast (30 min) and catches most quality issues.

The grading session cadence. Quarterly review of top runbooks; annual review of all runbooks. The cadence keeps freshness from drifting and surfaces newly-rotted runbooks before they're needed.

Common antipatterns

"We have runbooks" without quality awareness. Team checks the existence box but never measures usefulness. The grading exercise reveals the gap.

Auto-generated runbooks. Tool generates a runbook from telemetry; engineers never validate. The runbook reads plausibly but doesn't actually work. Auto-generation is a starting point, not the finish.

Runbooks owned by everyone (so by no one). A runbook with no owner doesn't get maintained. Each runbook has a named owner whose service it relates to.

The "we'll improve runbooks during quiet weeks" plan. Quiet weeks don't exist. Allocate explicit runbook-quality time; otherwise it never happens.

What to do this week

Three moves. (1) List your top-10 most-paged scenarios. Grade their runbooks using the four criteria. (2) For any D-graded runbooks, schedule a rewrite within the next sprint — these are bombs. (3) Add a "last validated" timestamp to every runbook. Without the visible date, freshness drifts invisibly.