p99 and Tail Latency: The Number You Cannot Ignore

Average latency is comforting and wrong. p99 is uncomfortable and right.

Why average lies

Average latency hides the worst experiences. Two services with the same mean can deliver wildly different user experiences depending on the shape of the tail.

Same mean, different p99. A 200ms average over 1000 requests can have a p99 of 250ms or 5 seconds; you cannot tell from the mean.
Tail users are real. The 1% in the tail are real users; their experience is the experience for them.
Compound tail. Microservices fan out; each hop multiplies tail probability; the user-visible p99 is much worse than any single service's p99.
Mean conceals trend. A regression in the tail can move the mean by milliseconds; the alert never fires.

Four causes of tail growth

1. Lock contention.
2. Garbage collection.
3. Cold caches.
4. Resource saturation in part of the fleet.

Per-cause mitigation

Each cause has a different remediation. Identify which one is dominant before reaching for a fix; mitigations rarely transfer across causes.

Lock contention. Move to lock-free data structures, partition the work, or shard the resource being contended.
GC pauses. Tune the collector, shrink the heap, or move to a low-pause collector (ZGC, Shenandoah on JVM).
Cold caches. Pre-warm on deploy or restart, use per-region caches, raise TTL where staleness is acceptable.
Resource saturation. Rebalance load, add capacity to the saturated subset, or shed traffic away from hot pods.

Tail-aware monitoring

You cannot fix what you cannot see. Tail-aware monitoring records distributions, not summaries, and alerts on percentiles directly.

Histograms. Prometheus _bucket series record full distributions; query p50, p99, p99.9 from the same data.
Per-percentile alerts. Alert on p99 and p99.9 separately; mean-based alerts miss the regressions that matter.
Per-tenant percentile. Tail latency often concentrates on one or two tenants; aggregate-only metrics hide it.
SLO targets. SLOs measured at p99 force the team to design for the tail; mean-based SLOs reward the wrong thing.

Antipatterns

Average-only monitoring. Misses tail.
p99 ignored as ‘outlier.’ User-impacting.
Optimizing average. Wrong target.

What to do this week

Three moves. (1) Apply this pattern to your slowest production endpoint. (2) Measure p99 before/after. (3) Document the win and ship the runbook so the team can reproduce.