SLO Launch Checklist

Before SLO is enforced.

Required for launch

An SLO that ships without baseline data, a verified metric source, and stakeholder agreement is a target on paper, not an operational discipline. The discipline is to establish each before the SLO is published so the number is defensible.

Baseline data collected. At least 30 days of metric history before promoting to a hard SLO; without data, the target is guess-work.
Metric source verified. SLI metric is correct, freshness is healthy, sampling is unbiased; a bad source produces a meaningless SLO.
Stakeholder agreement. Service team, consumer team, leadership; each agrees on the target and the policy before launch.
Documented launch context. Per-SLO motivation and history captured; supports investigation when the target is challenged later.

Monitoring wired

An SLO without alerts and a dashboard is a number nobody watches. The monitoring layer turns the SLO into an operational signal: burn-rate alerts page when the budget is at risk, the dashboard exposes trend, and customer-facing SLOs surface on the status page.

Multi-window burn-rate alerts. 1-hour, 6-hour, 3-day windows; tested by injecting failures in pre-prod.
SLO dashboard published. Visible to stakeholders; auto-refreshes; standard layout for current achievement, trend, recent breaches.
Status page integration. Customer-facing SLOs reported quarterly or monthly; supports trust and external communication.
Pre-prod alert tests. Burn-rate alerts validated by injected failure; the alert path is verified before launch, not in incident.

Policy documented

The error budget policy is the contract between the SLO and the team. Without a written policy that names what happens at 50%, 25%, and 0% budget, the SLO is a metric without consequences; the discipline is to commit the policy to writing before launch.

Error budget policy in writing. What happens at 50%, 25%, 0%; documented, agreed, visible.
Action items for budget exhaustion. Feature freeze, reliability review, increased on-call attention; specific actions, not generic posture.
Recovery criteria. Budget restored, freeze lifts; the bar to exit policy actions is documented to avoid arguments later.
Per-SLO policy variation. Critical SLOs have stricter policies than internal SLOs; the variation is documented per-SLO.

Runbook ready

When the burn-rate alert pages, the on-call needs a runbook that points at the first place to look and the second action to take. Without a linked runbook the alert becomes an investigation from scratch every time, which is the slow path.

First step: where to look. The runbook names the dashboard or query that surfaces current state; the on-call starts from data, not a search.
Second step: what to do. Common remediations documented; the on-call has a triage tree, not a blank page.
Common causes documented. Recent deploys, traffic spikes, dependency issues; the triage tree is part of the runbook.
Escalation tree clear. Senior on-call, service owner, manager; the next call is documented before the page fires.

Review cadence committed

An SLO without a recurring review drifts. Monthly performance review, quarterly recalibration, annual deep review of the SLI definition; the cadence keeps the SLO meaningful as the service evolves and customer expectations shift.

Monthly performance review. Trend, burn rate, recent breaches, action items; 30 minutes, same agenda each month.
Quarterly recalibration. Was the target right? Customer expectations? Engineering capacity? Adjust with data.
Annual deep review. Is the SLI still measuring the right thing? Has the service evolved? Update the definition if needed.
Per-review action capture. Each review produces named action items with owners; supports follow-through, not just discussion.