Alerts Beginner By Samson Tanimawo, PhD Published Sep 2, 2026 7 min read

The Alert-to-Runbook Attachment Pattern

A page without a runbook is a tax on the on-call. Pattern: every alert carries a runbook URL in its annotations, enforced at PR time, broken by CI when missing.

Why naked alerts cost so much

An alert without a runbook is a noun without a verb. The on-call gets paged at 3:14am, sees “PaymentLatencyHigh”, and now has to figure out from scratch what that means, how it was wired, and what to do about it. Every minute spent searching is a minute the customer waits.

The cost. We measured it on three teams. Median time-to-action for an alert with a runbook: 4 minutes. Without one: 18 minutes. Across a year of pages, that’s thousands of person-hours of avoidable thrash, and a noticeable chunk of customer-visible downtime.

The hidden cost. Knowledge in the alert author’s head doesn’t survive the org. The author rotates off the team; the alert keeps firing; the next on-call is left holding a string with no label. Runbooks are how alerts pay their tax forward.

The annotation-first pattern

The pattern is simple: every alert rule carries a runbook_url annotation that points at a real document. The pager passes it through to the notification. The on-call clicks once and lands on the page.

In Prometheus terms it’s an annotations.runbook_url field. In a CloudWatch alarm it’s a description-with-link. In Nova AI Ops it’s a first-class field on the rule. The exact spelling doesn’t matter; the discipline does.

The rule of thumb. If you can’t name the runbook URL when you’re writing the alert, the alert isn’t ready. The act of writing the URL forces you to write the runbook. Don’t skip it, you’ll never come back.

The naming convention. One runbook per alert, named after the alert (/runbooks/payment-latency-high). Not one runbook covering ten alerts; not one alert covering ten runbooks. The 1:1 mapping is the only way to make “click the link” reliably useful.

Enforce it in CI

Manual discipline doesn’t scale. Pull-request time is when you have leverage; later you don’t. Add a CI check on the alert-rules repo: every rule must have a non-empty runbook_url; the URL must return 200; the document must contain a heading matching the alert name. Three checks; ten lines of bash.

The lint. A simple yaml-walker that opens every *.alerts.yaml and asserts spec.groups[].rules[].annotations.runbook_url is set. New alerts that miss it fail the build; the author writes the runbook to merge.

The link check. CI hits each runbook URL with curl. 200 passes; 404 fails. Catches the rename-without-redirect, the moved-wiki, the typo-in-URL.

The contents check. Optional but valuable: assert the runbook document contains the alert name as a heading. Catches the “wrote one runbook, linked it from ten alerts” antipattern. Forces the 1:1 mapping that makes the click useful.

What a good runbook looks like

Five sections, no more. (1) What this alert means, the human-language version of the trigger. (2) What the customer sees, cart fails, checkout stalls, dashboard 500s. (3) First check, the one query/dashboard/log that confirms or denies the alert. (4) Top causes ranked, the three most-likely root causes with the diagnostic for each. (5) Remediation, the rollback, the restart, the page-the-DBA decision.

The five-section template is enough for 80% of alerts. The ones that need more aren’t runbooks, they’re design docs and should live elsewhere with a link from the runbook.

The anti-pattern: the runbook that says “contact the team.” If the runbook is a redirect to Slack, you don’t have a runbook, you have a phone-tree. Replace it with the actual first check.

Where AI changes the picture

The runbook used to be a static document. Now it’s an executable. Nova AI Ops parses the runbook on alert fire, runs the diagnostics in parallel, attaches the results to the incident, and proposes the remediation. The on-call lands on the page with the work half-done.

The shape of the change. The runbook still gets written by humans. The execution gets done by an agent. The on-call reads the conclusion, not the script. Time-to-action drops from 4 minutes to under 1; many incidents auto-resolve before the human even ack’s the page.

The thing not to skip. Even with agents in the loop, write the runbook. The agent reads it as instructions. Without it, the agent has nothing to execute, and you’re back to a human at 3am.

What to do this week

Three moves. (1) Audit your top-10 firing alerts, how many have a runbook URL annotation? Most teams find 30-50%. (2) Add the CI lint to the alerts repo, ten lines of yaml-walker; it pays for itself in two weeks. (3) Backfill the missing runbooks for the top-10. Use the five-section template; one paragraph per section is fine. The compounding return on these three moves is the best ratio you’ll find in alerting.