The Percentile Trap in Aggregated Metrics
p99 of p99 is not p99. The math that breaks aggregation, the cases where it matters, and the workarounds.
The math
Quantiles do not commute with averaging. The p99 across 5 instances is NOT the average of their individual p99s.
The error can be 10-30% in real workloads. Enough to mislead decisions.
Histograms solve it
Use histogram metrics, not pre-aggregated quantiles. Aggregate the histograms; compute the quantile from the aggregate.
PromQL supports this with histogram_quantile() over sum(rate(... _bucket)).
When it matters
When you compare aggregate p99 across services, regions, or time. The trap is most visible there.
Single-instance p99 is fine. The trap is in aggregation, not in the per-instance quantile.