What Is an SLO? A Beginner's Guide to Service Level Objectives
SLOs are the contract between your engineering team and the rest of the business. Written right, they shape everything from on-call rotations to roadmap priorities.
Why SLOs exist
An SLO, Service Level Objective, is a specific, measurable target for how well a service performs. It is typically written as a percentage over a window: “99.9% of login requests will return successfully in under 500ms, measured over 28 days.”
Before SLOs, the conversation about reliability is vague: “the site is slow.” After SLOs, it is precise: “login is at 99.5% this week, two tenths below target, and the burn rate says we will blow the monthly budget by Thursday if we don't act.”
Three ingredients of a good SLO
Every SLO has three components. Getting any one of them wrong makes the whole thing useless.
- A user-facing metric, not an internal one. “CPU under 80%” is not an SLO, users don't care about CPU. “Checkout completes in under 2 seconds” is.
- A target number that is deliberately below 100%. 100% is not a goal; 99.9% says “we accept that 0.1% of requests can fail for reasons outside our control.”
- A rolling window long enough to smooth noise, short enough to act on. 28 days is the sweet spot.
How to set the target number
The most common mistake is picking a round number from a marketing table (“let's do 99.99%, looks aspirational”). The right way is to look at your actual historical performance.
Pull 90 days of data. Find the 95th percentile of the metric. That is roughly where you should set the SLO, ambitious enough to matter, achievable enough that your team isn't perpetually on fire.
Three mistakes in the first quarter
- Writing SLOs for every endpoint. Pick three user journeys. A team with 40 SLOs has zero SLOs.
- Using raw percentages instead of request counts. “99.9% availability” sounds the same whether you have 100 requests/day or 100 million. The burn-rate math doesn't.
- Not wiring SLOs into alerting. An SLO in a dashboard nobody looks at is a ritual, not a practice.
What to do this week
Pick your most important user journey, the one that, if broken, triggers a Slack fire drill within 10 minutes. Define one SLO for it: a latency target and a success-rate target, both measured over 28 days, set at your historical 95th percentile.
Add two alert rules: a fast-burn alert (2% of monthly budget consumed in an hour) and a slow-burn alert (10% of monthly budget in six hours). Point them at your on-call rotation.
One SLO, one burn-rate pair, one user journey. Measure for a month, then add the second.
An SLO in a dashboard nobody looks at is a ritual, not a practice.
What changes in your first month
Week one: pick the user journey. Write the SLI (numerator/denominator spelled out), set the target at your 95th percentile, and wire the two burn-rate alerts.
Weeks two and three: let it run. Resist the urge to tune the target. Watch the burn rate. If it fires, the alert is doing its job; if it never fires on a week where users complained, the SLI is wrong, not the target.
Week four: hold a 30-minute review. Did the SLO track how users actually felt that month? If yes, add the second SLO. If not, rewrite the first one.