SLOs Against Aggregations vs Against Percentiles

The aggregation you choose changes the meaning of the SLO. Pick deliberately, not by default.

Why the aggregation matters

Average latency hides the long tail; p99 surfaces it. Two services with the same average can have radically different user experience; the aggregation choice changes what the SLO actually measures.

Four common shapes

Four aggregation shapes cover most SLO definitions. Each has a sweet spot and a failure mode; knowing both prevents picking the wrong one by default.

Matching to user perception

User perception of "slow" happens at p95-p99 of their own request stream. Counterintuitively, p99 of your aggregate often matches a real user’s p50 because users make many requests; pick the SLO base to model user experience, not infrastructure tidiness.

Combining shapes

Some teams ship separate SLOs at p50 and p99 simultaneously. Catches both classes of regression: median drift (everyone slower) and tail expansion (some users much slower). The cost is 2x SLO maintenance; worth it for high-stakes services.

Antipatterns

What to do this week

Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.