Product Updates Advanced By Samson Tanimawo, PhD Published Sep 5, 2026 7 min read

Correlation Engine: 38% p95 Speedup

We cut peak-load p95 on the incident-correlation engine by 38%. Here's the graph rewrite, the indexing change, and the load-shedding behaviour we built for storms.

What was slow

The correlation engine takes alerts, traces, and metric signals as input and produces incident clusters as output, "these 14 signals are the same thing" or "this is a new incident, not a continuation of the existing one." Under steady-state load it ran at 200ms p95. Under storm conditions, a noisy-neighbour service spraying 3,000 alerts in 60 seconds, p95 climbed to 4-6 seconds. That's not catastrophic on its own; the catastrophic part was that incident clustering decisions were arriving late, which meant on-call engineers got a flood of separate-looking alerts that should have been one cluster.

We profiled. The hot path was a graph traversal over service dependencies; for each new signal, we walked outward from the affected service to find related signals from connected services. The traversal was n-hop bounded but unindexed, for services with high fan-out (a database, a shared queue), each new signal triggered a re-traversal of the same neighbourhood from a different entry point.

The graph rewrite

The fix: replace the on-the-fly traversal with a materialised neighbourhood index. For every service, we precompute its 2-hop and 3-hop neighbourhoods at write time (when the service registry or topology changes) instead of read time (when correlating). The neighbourhood is a fixed-size set of service IDs; lookup is O(1).

The trade-off: writes get more expensive when the topology changes. Adding a new service or a new dependency invalidates the affected neighbourhoods and triggers a recompute. We made this acceptable by computing neighbourhoods lazily, invalidate on write, recompute on next read. Most service-registry changes happen during deploys, when correlation queries are quiet anyway, so the recompute cost is paid in a low-load window.

We also collapsed the graph at the topology layer. Service dependencies that were transitively equivalent, A calls B which only ever calls C, got compressed to a single edge with a hop count. The collapse cut the average node degree by 23% and made the neighbourhoods materially smaller. Less data to look up means faster lookups; less data to ship across the wire means faster round-trips.

The indexing change

The second change was the alert-grouping index. We were keying alert records by service + alert type. Under storm conditions, a single service hammering with one alert type produced a hot row that became the bottleneck, every correlation query had to take a row-level lock on it.

The new key is service + alert type + 30-second time bucket. Same alerts now spread across multiple rows; the hot row is gone; lock contention drops. The downside: we have more rows now, and the rows expire faster (we GC the time-bucket rows after 30 minutes). The trade-off is the right one for the workload, read latency dominates write throughput by a wide margin, and the rows are tiny.

The index lives in the same Postgres backend as the rest of the correlation state. We considered a separate datastore, Redis, Cassandra, a custom in-memory index, and rejected it because the operational complexity of a second datastore wasn't worth the tail-latency improvement we'd have gained. Boring choices win again.

Load-shedding under storm

The third change was load-shedding. When the correlation engine sees more than 500 signals per second sustained for 30 seconds, it switches into a degraded mode: it stops doing 3-hop neighbourhood lookups (uses 2-hop only), it stops doing optional dimension-aware clustering (uses service-aware only), and it batches signals in 5-second windows instead of processing each one immediately.

The degraded mode is announced, you see a banner on the incidents page that says "correlation engine running in load-shed mode." Cluster quality is lower in this mode (more false-separates, fewer false-joins) but the engine stays responsive. We'd rather have slightly lower-quality clustering at storm peak than a frozen engine producing nothing at all.

The mode auto-recovers when load drops below 200 signals per second for 60 seconds. The hysteresis prevents flapping; the recovery is invisible to the operator other than the banner disappearing. We've tested this against five real storm scenarios from beta tenants; in each case the engine stayed responsive and recovered within 90 seconds of the storm subsiding.

The actual numbers

Steady-state p95 went from 200ms to 142ms, a 29% improvement. Storm-load p95 went from 5.4 seconds to 1.8 seconds, a 67% improvement. The headline number we report is peak-load p95 down 38%, weighted across the workload mix we see in production. p99 improvements are larger than p95 in both regimes; tail-heavy queries benefited disproportionately from the indexing change.

What we didn't change matters too. The correlation algorithm itself is unchanged, same heuristics, same scoring, same output format. The API contract is unchanged; clients didn't have to update. Latency improvements at scale almost always come from infrastructure changes, not algorithm changes; this release confirmed that pattern again.

If you're looking at correlation latency in your own workloads, the lessons that transferred: precompute the neighbourhoods, key indexes by time bucket to spread hot rows, and define an explicit degraded mode for storm conditions. None of these are novel; doing them in the right order on the right metric is what produced the improvement.