How to Set SLOs That Match What Your Users Actually Feel
99.9% on a backend service can mean a great quarter or a terrible one for users. The only SLO that matters is the one written from the user's chair.
The folklore problem
Most SLO targets are inherited folklore. "We do 99.9% because that's what people do." The number has no relationship to user experience and no relationship to engineering capacity. It exists because someone had to put a number on the page.
The honest test. Ask the on-call engineer: "what does it feel like when we miss this SLO." If the answer is "no different from any other day," the SLO is fiction. The number must correspond to a felt change in user behaviour or business outcome, otherwise no one will defend it during the planning meeting.
Critical user journeys, not services
Per-service SLOs measure the wrong thing. Users do not perceive "the auth service was 99.95% successful." They perceive "I could not log in." A login attempt may touch six services; if any one fails, the user feels failure. The SLO needs to be on the journey, not the service.
The CUJ framing. List the 5-10 things users do most often (sign in, complete checkout, post a comment, view dashboard, send a message). Each is a critical user journey. The SLO is "this CUJ succeeds at X%." The math sums all backing services.
The benefit. CUJ-level SLOs cannot be gamed by tuning one service while another quietly degrades. The number tracks reality.
The four-step process
1. Pick the journey. Start with the highest-revenue or highest-volume CUJ. Do one well before doing five.
2. Define success in code. A success is "user finished the journey within N seconds with no error." A failure is anything else. Write the metric query that produces this.
3. Measure for 30 days, no target. Just observe. The number is your baseline. The 28-day average + standard deviation tells you what is achievable today.
4. Set a target slightly above the baseline. If today you do 99.4%, the SLO is 99.5%. The 0.1% gap is the engineering ambition. Setting it to 99.99% with a 99.4% reality is theatre.
Setting the target without lore
The right number is the one that, when missed, the team and the business agree to act on. If 99.9% would not change behaviour, 99.5% would not change behaviour, but 99.0% would, then 99.0% is your number. Below that the business cares; above it nobody acts.
Test the number with a thought experiment. "If next month we miss it by 0.1%, what changes?" If the answer is "nothing," the number is wrong. The SLO should be the line below which engineering says "stop feature work, fix reliability." If no one would say that, lower the line.
Antipatterns
Five-9s for everything. Inherited from telecom; rarely reachable in modern web infrastructure. Picking it because it sounds impressive guarantees you will miss it routinely.
SLOs no one negotiates. The point of an SLO is the conversation it forces. If the SLO never comes up in planning, it is not doing its job.
Excluding ourselves from the SLO. "Errors due to dependent services do not count." Users do not care about your blame model. Include them.
What to do this week
Three moves. (1) Pick the most revenue-critical CUJ in your product; write the success-criteria query. (2) Measure for 30 days; do not declare a target yet. (3) Schedule a meeting with engineering and product leadership for week 5; bring the baseline and propose a target the room would actually act on.