SLO Breach Runbook Template
When SLO breaks, what to do.
Immediate
When an SLO breach alert fires, the on-call has minutes to figure out what is happening before the budget burns into territory that takes weeks to recover. The runbook's first job is to give the responder a structured place to start so they are not improvising the investigation under pressure.
What the immediate response should cover, in order:
- Triage: what is breaking?.: Is it availability, latency, error rate, freshness, or something compound? The alert payload should already say. Confirm by looking at the per-dimension dashboard. Don't start investigating before you know which dimension is bad.
- Active or trending?.: Is the burn rate currently high (active incident, customers being affected right now) or has it been elevated for hours and is now slowly burning the budget down (degraded state)? Active is page the on-call right now; trending is open a ticket and investigate during business hours.
- What is the blast radius?.: Single tenant, single region, single service, or the whole platform? Look at the per-tenant and per-region breakdowns. A single-tenant spike points to that customer's specific behavior or data; a global spike points to a deploy or a shared dependency.
- Within 5 minutes, post status.: Even if you don't know the answer yet, post the initial diagnosis to the deploy channel and to the status page if applicable. "We are investigating elevated error rate on service X starting at 14:32 UTC" is enough. The communication latency is what stakeholders judge response on.
- Don't fix yet, scope first.: The temptation is to start fixing immediately. Scope first, fix second. Knowing the blast radius and the dimension prevents a fix in the wrong place that makes things worse.
The first 5 minutes set the tone for the rest of the response. A clear structured start beats a fast unstructured one every time.
Triage
Once the immediate scoping is done, the runbook walks through the standard causes in priority order. Most SLO breaches fall into a small number of well-known categories. Checking them systematically gets to a root cause faster than open-ended investigation.
- Recent deploy.: The most common cause of SLO breaches is a deploy that landed in the past few hours. Check the deploy log for the affected service. If a deploy correlates with the breach onset, roll it back as the first action and investigate the regression after recovery. The cost of rolling back a clean deploy is small; the cost of investigating around a bad deploy is big.
- Traffic spike.: A sudden change in traffic volume or pattern can push a service over its capacity threshold. Check the request rate, the per-tenant rate, and the request mix. A 10x spike from one tenant is a different problem from a global increase. Both have different responses.
- Dependency degradation.: An upstream service that is slower or less reliable than usual cascades into your own SLO. Check the per-dependency error rate and latency. The dependency's own status page may already be lit up. If so, your job becomes degradation management until the upstream recovers.
- Capacity exhaustion.: Database connections, cache memory, queue depth, file descriptors, network sockets. Saturation metrics. Anything at 90%+ is a candidate. The fix is either to scale up immediately or to shed load until capacity recovers.
- External event.: Cloud provider region issue, ISP problem, DNS resolver flake, third-party API outage. Check the cloud provider status page. If it is them, your response is communication and graceful degradation rather than fixing the root cause.
- Standard checklist, applied in order.: Walk these in sequence. The pattern in 80% of breaches is in this list. The remaining 20% require deeper investigation, but only after the standard causes have been ruled out.
The triage section makes incident response repeatable. New on-calls can run the runbook the same way the senior ones do, which is what turns the team's accumulated knowledge into operational discipline.
Escalate
Most breaches resolve at the on-call level. Some do not. The runbook's escalation path tells the on-call when to call for help and who to call. Defining this in advance is what prevents the worst case: an on-call struggling alone with an incident that needed leadership involvement an hour ago.
- Sustained breach: leadership engages.: If the budget burn continues for more than 30 minutes despite triage, escalate to the SRE lead. If it continues past 2 hours or threatens the customer-facing SLA, escalate to engineering leadership. Each level has a clear trigger.
- Resource decisions.: Some breaches require resources the on-call cannot deploy alone: pulling in another team, breaking the deploy freeze for a hot fix, scaling beyond auto-scaling limits, contacting the cloud provider's enterprise support. Leadership engagement unblocks these.
- Cross-team ask.: When the breach is caused by a dependency owned by another team, the runbook tells the on-call who to page. The phone tree is part of the runbook, kept up to date. "Page the database team's on-call directly, do not go through their normal request flow during an incident."
- Customer comms escalation.: When the breach affects high-value customers, the runbook routes the comms ask to the right place: customer success leadership, executive sponsor, account management. The on-call should not be drafting a CEO-level customer message at 3 AM.
- Postmortem opens with the escalation.: When leadership is engaged, a postmortem ticket is opened immediately. The retro will happen; the ticket existing from minute one prevents it from getting deferred or forgotten in the recovery rush.
An SLO breach runbook with structured immediate response, standard triage, and clear escalation is the discipline that turns reliability incidents from chaotic firefighting into routine operational practice. Nova AI Ops generates SLO breach runbook templates per service, surfaces the standard triage checks against live telemetry, and triggers the escalation path automatically when the burn rate crosses defined thresholds.