Performance Intermediate By Samson Tanimawo, PhD Published Dec 20, 2026 10 min read

p99 Latency Diagnosis: A Field-Tested Workflow

When the median is fine but the long tail is killing users, the first instinct is wrong about half the time. Here is the workflow that lands the diagnosis in under an hour.

Step 1: confirm it is actually p99

"Latency is up" rarely means what people think it means. Pull the percentile graph for the last 24 hours and look at p50, p90, p99, p99.9 side by side. If p50 also rose, this is a system-wide issue, not a tail. If only p99 rose, the tail is real.

The ratio matters too. p99/p50 of 5x is normal for many systems. p99/p50 of 50x is a tail problem; the long requests are wildly off the typical case. Knowing this number frames everything that follows.

Step 2: look at the histogram, not the percentile

Percentiles compress information. A latency histogram (Prometheus _bucket, Datadog distribution, OpenTelemetry histogram) shows the actual shape, bimodal, fat tail, single hump. The shape tells you the kind of problem.

Bimodal: two clusters, one fast, one slow. Some path is taking the slow road; find which.

Fat tail: most requests fast, a small number very slow. Often a database tail, a GC pause, or a cold cache.

Slow drift: the whole distribution shifts right over hours. Resource pressure, slow leak, or growing dataset.

Step 3: segment by everything obvious

Slice p99 by endpoint, by region, by tenant, by client version, by deploy version. The slow requests usually concentrate in one slice. If they spread evenly across all slices, the problem is shared infrastructure (database, cache, network). If they concentrate, the problem is local.

The "tenant skew" pattern. One tenant whose data is 50x bigger pulls the tail of any per-tenant query. A separate slice or a partial index for that tenant often solves it.

Step 4: the three usual suspects

Lock contention. When the database serialises queries on a row or a table, p99 explodes while p50 stays fine. Look at lock-wait metrics; in Postgres, pg_stat_activity.wait_event_type tells you.

GC or runtime pauses. JVM, .NET, Go all have stop-the-world events. A p99 spike that correlates exactly with GC duration metrics is the smoking gun.

Cold caches. A cache miss that goes to a slow backend creates a fat tail. The miss rate is fine on average, but each miss is 100x slower than a hit. Often hides behind "average is fine."

Step 5: trace one slow request end to end

Distributed tracing pays for itself in moments like this. Pull a single trace from the slow bucket and look at the spans. The slow span is usually obvious, and it is usually not the span the engineer guessed.

If you do not have tracing, the workflow is "pull logs for one slow request, manually reconstruct the path, time each step." This works but takes 10x longer. If your team is doing this often, install tracing this quarter.

Step 6: validate the fix moved the histogram

Deploy the fix; do not declare victory until you see the histogram change shape. The percentile number moving is good; the histogram normalising is the proof. Many "fixes" reduce p99 by 10% but leave the underlying tail intact, and the next traffic spike re-exposes it.

Antipatterns

Optimising the average. The average is the wrong metric for user-perceived performance. Stop reporting it.

Adding more replicas to a tail problem. If the tail is per-request (lock, GC, cold cache), more replicas does not help; each replica still has the tail.

Sampling out the slow requests. Some tracing setups sample-out high-latency requests because they are "outliers." That is exactly the wrong direction. Sample-in slow requests preferentially.

What to do this week

Three moves. (1) On your top-3 endpoints, replace any avg-latency dashboard with a histogram visualisation. (2) Add tail-aware sampling to your tracing config so slow requests are over-sampled. (3) Identify the endpoint with the worst p99/p50 ratio and walk through this six-step workflow on it.