SLOs Against Aggregations vs Against Percentiles
The aggregation you choose changes the meaning of the SLO. Pick deliberately, not by default.
Why the aggregation matters
Average latency hides the long tail. p99 surfaces it.
Two services with same average can have very different user experience.
Four common shapes
- Average: easy; misleading for tail.
- p50/median: better; still hides 50% of users.
- p99: catches the tail; users in the tail are real.
- p99.9: bleeding edge; one user in 1000.
Matching to user perception
User perception of ‘slow’ happens at p95-p99 of their own request stream. p99 of YOUR aggregate often matches a real user’s p50.
For most consumer SaaS, p99 is the right SLO base.
Combining shapes
Some teams: separate SLOs at p50 + p99. Catches both regression types.
Cost: 2x SLO maintenance. Worth it for high-stakes services.
Antipatterns
- Average-only SLO. Misses tail outages.
- p99.9 SLO without budget for the tail. Always burning.
- Same shape for every service. Mismatch with user model.
What to do this week
Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.