The Percentile Trap in Aggregated Metrics

p99 of p99 is not p99. The math that breaks aggregation, the cases where it matters, and the workarounds.

The math

The percentile trap in aggregates is a common metric mistake. Computing percentiles per instance and then averaging them produces wrong results; the wrong results look correct and mislead decisions. Recognizing the trap and fixing the aggregation is the discipline.

What the math says:

Quantiles do not commute with averaging.: The mathematical operation matters. Computing p99 on each instance and averaging the results is not the same as computing p99 across all instances together. The two produce different numbers.
The p99 across 5 instances is NOT the average of their individual p99s.: If 5 instances each have p99 of 100ms, the aggregate p99 is not 100ms. The aggregate p99 considers all the data points across all instances; one instance's p99 might be the aggregate's p95 or p99.5.
The error can be 10 to 30% in real workloads.: The mathematical error translates to real misrepresentation of latency. The team's understanding of their service's tail latency is off by a meaningful amount.
Enough to mislead decisions.: SLO compliance, capacity decisions, optimization priorities all can be wrong when the percentile is wrong. The trap produces real consequences.
The trap is invisible without specific testing.: The wrong number looks plausible. Teams sometimes go years with the trap in their dashboards before realizing the mistake.

The math is the foundation. Understanding why averaging percentiles is wrong is the prerequisite for fixing it.

Histograms solve it

The fix is to use histograms instead of pre-aggregated percentiles. Histograms aggregate correctly; the percentile is computed from the aggregated histogram.

Use histogram metrics.: Instead of pre-computing percentiles per instance, emit histograms. Each histogram has bucket counts; the buckets aggregate correctly.
Not pre-aggregated quantiles.: Pre-aggregated quantiles cannot be aggregated correctly. The histograms preserve the underlying distribution; the percentile can be computed from the aggregate.
Aggregate the histograms.: Sum the histogram bucket counts across instances. The summed histogram represents the combined distribution; the result is correct.
Compute the quantile from the aggregate.: The percentile is computed from the aggregated histogram. The result is the true aggregate percentile; the trap is avoided.
PromQL supports this.: histogram_quantile(0.99, sum(rate(metric_bucket[5m])) by (le)) is the standard pattern. The sum aggregates buckets; histogram_quantile computes the percentile correctly.

Histograms are the right tool for percentile aggregation. The pattern is well-known; the discipline is using it.

When it matters

The trap matters when aggregate percentiles drive decisions. Single-instance percentiles are correct; the trap appears in aggregation across instances, regions, or time periods.

When you compare aggregate p99 across services, regions, or time.: Cross-service comparison, multi-region aggregation, time-window comparison all involve aggregating percentiles. The trap appears in each.
The trap is most visible there.: Decisions based on aggregate percentiles can be 10 to 30% off. The team makes choices on wrong data; the consequences are real.
Single-instance p99 is fine.: A single instance's percentile is correctly computed from its own data. The percentile represents that instance accurately.
The trap is in aggregation, not in the per-instance quantile.: The per-instance number is correct; combining multiple per-instance numbers via averaging is wrong. Knowing where the trap lives prevents over-correction.
Audit existing dashboards.: Many teams have dashboards with the trap embedded. The audit identifies them; remediation replaces averaged percentiles with histogram-based aggregates.

The percentile trap in aggregates is one of those metric mistakes that everyone makes once and learns from. Nova AI Ops integrates with metric stores, surfaces percentile aggregation patterns, and helps teams identify and fix the trap before it produces wrong decisions.