SLO Launch Checklist
Before SLO is enforced.
Required for launch
Baseline data collected. At least 30 days of metric history before promoting to a hard SLO. Without data, the target is guess-work.
Metric source verified. Check that the SLI metric is correct, freshness is healthy, sampling is unbiased. A bad metric source produces a meaningless SLO.
Stakeholder agreement. The team owning the service, the team consuming it, leadership. Each agrees on the target and the policy.
Monitoring wired
Multi-window burn rate alerts configured. 1-hour, 6-hour, 3-day windows. Tested by injecting failures in pre-prod.
SLO dashboard published. Visible to stakeholders. Auto-refreshes. Standard layout: current achievement, trend, recent breaches, action items.
Status page integration if customer-facing. Achievement reported quarterly or monthly publicly.
Policy documented
Error budget policy in writing. What happens when the budget is at 50%? At 25%? At 0%? Documented; agreed; visible.
Action items for budget exhaustion. Feature freeze, reliability review, increased on-call attention. Specific.
Recovery criteria. Budget restored: when do feature freezes lift? Document the bar to avoid arguments later.
Runbook ready
When the SLO is at risk, the on-call has a runbook. First step: where to look. Second: what to do. Linked from the alert.
Common causes documented. Recent deploys, traffic spikes, dependency issues. The triage tree is part of the runbook.
Escalation tree clear. Who to page if the on-call cannot resolve. Senior on-call, service owner, manager. Documented.
Review cadence committed
Monthly review of SLO performance. Trend, burn rate, recent breaches, action items. 30 minutes, same agenda each month.
Quarterly recalibration. Was the target right? Customer expectations? Engineering capacity? Adjust with data.
Annual deep review. Is the SLI still measuring the right thing? Has the service evolved? Update the definition.