SLO & Reliability Practical By Samson Tanimawo, PhD Published Jul 11, 2025 4 min read

SLO Breach Runbook Template

When SLO breaks, what to do.

Immediate

When an SLO breach alert fires, the on-call has minutes to figure out what is happening before the budget burns into territory that takes weeks to recover. The runbook's first job is to give the responder a structured place to start so they are not improvising the investigation under pressure.

What the immediate response should cover, in order:

The first 5 minutes set the tone for the rest of the response. A clear structured start beats a fast unstructured one every time.

Triage

Once the immediate scoping is done, the runbook walks through the standard causes in priority order. Most SLO breaches fall into a small number of well-known categories. Checking them systematically gets to a root cause faster than open-ended investigation.

The triage section makes incident response repeatable. New on-calls can run the runbook the same way the senior ones do, which is what turns the team's accumulated knowledge into operational discipline.

Escalate

Most breaches resolve at the on-call level. Some do not. The runbook's escalation path tells the on-call when to call for help and who to call. Defining this in advance is what prevents the worst case: an on-call struggling alone with an incident that needed leadership involvement an hour ago.

An SLO breach runbook with structured immediate response, standard triage, and clear escalation is the discipline that turns reliability incidents from chaotic firefighting into routine operational practice. Nova AI Ops generates SLO breach runbook templates per service, surfaces the standard triage checks against live telemetry, and triggers the escalation path automatically when the burn rate crosses defined thresholds.