SLO Launch Checklist
Before SLO is enforced.
Required for launch
An SLO that ships without baseline data, a verified metric source, and stakeholder agreement is a target on paper, not an operational discipline. The discipline is to establish each before the SLO is published so the number is defensible.
- Baseline data collected. At least 30 days of metric history before promoting to a hard SLO; without data, the target is guess-work.
- Metric source verified. SLI metric is correct, freshness is healthy, sampling is unbiased; a bad source produces a meaningless SLO.
- Stakeholder agreement. Service team, consumer team, leadership; each agrees on the target and the policy before launch.
- Documented launch context. Per-SLO motivation and history captured; supports investigation when the target is challenged later.
Monitoring wired
An SLO without alerts and a dashboard is a number nobody watches. The monitoring layer turns the SLO into an operational signal: burn-rate alerts page when the budget is at risk, the dashboard exposes trend, and customer-facing SLOs surface on the status page.
- Multi-window burn-rate alerts. 1-hour, 6-hour, 3-day windows; tested by injecting failures in pre-prod.
- SLO dashboard published. Visible to stakeholders; auto-refreshes; standard layout for current achievement, trend, recent breaches.
- Status page integration. Customer-facing SLOs reported quarterly or monthly; supports trust and external communication.
- Pre-prod alert tests. Burn-rate alerts validated by injected failure; the alert path is verified before launch, not in incident.
Policy documented
The error budget policy is the contract between the SLO and the team. Without a written policy that names what happens at 50%, 25%, and 0% budget, the SLO is a metric without consequences; the discipline is to commit the policy to writing before launch.
- Error budget policy in writing. What happens at 50%, 25%, 0%; documented, agreed, visible.
- Action items for budget exhaustion. Feature freeze, reliability review, increased on-call attention; specific actions, not generic posture.
- Recovery criteria. Budget restored, freeze lifts; the bar to exit policy actions is documented to avoid arguments later.
- Per-SLO policy variation. Critical SLOs have stricter policies than internal SLOs; the variation is documented per-SLO.
Runbook ready
When the burn-rate alert pages, the on-call needs a runbook that points at the first place to look and the second action to take. Without a linked runbook the alert becomes an investigation from scratch every time, which is the slow path.
- First step: where to look. The runbook names the dashboard or query that surfaces current state; the on-call starts from data, not a search.
- Second step: what to do. Common remediations documented; the on-call has a triage tree, not a blank page.
- Common causes documented. Recent deploys, traffic spikes, dependency issues; the triage tree is part of the runbook.
- Escalation tree clear. Senior on-call, service owner, manager; the next call is documented before the page fires.
Review cadence committed
An SLO without a recurring review drifts. Monthly performance review, quarterly recalibration, annual deep review of the SLI definition; the cadence keeps the SLO meaningful as the service evolves and customer expectations shift.
- Monthly performance review. Trend, burn rate, recent breaches, action items; 30 minutes, same agenda each month.
- Quarterly recalibration. Was the target right? Customer expectations? Engineering capacity? Adjust with data.
- Annual deep review. Is the SLI still measuring the right thing? Has the service evolved? Update the definition if needed.
- Per-review action capture. Each review produces named action items with owners; supports follow-through, not just discussion.