Inside Nova's Incident Correlation Engine
How 200 raw alerts become a single incident with a single owner. Graph-based correlation, embedding similarity, topology-aware grouping, and the 38% p95 speedup we shipped in v2.7.
The correlation problem
One database having a slow query becomes 200 alerts in seconds. The database alerts on slow queries. The API alerts on slow upstream calls. The frontend alerts on slow page loads. The customer-impact monitor alerts on the elevated bounce rate. The synthetic monitor in three regions alerts on the failed checkout. The on-call dashboard lights up red.
The on-call engineer's job is figuring out that all 200 alerts are one incident with one owner. Tools that don't correlate force the engineer to do that triage manually under time pressure. Tools that correlate badly group unrelated alerts together (false positive) or split related alerts apart (false negative); both are costly.
Nova's correlation engine is the layer that takes raw alerts in and emits structured incidents out. The next sections cover what it actually does.
Three signals, one decision
The engine combines three signals to decide whether two alerts belong to the same incident. (1) Topological proximity, are the alerts on services that share a known dependency edge? (2) Embedding similarity, does the alert text describe related symptoms? (3) Temporal proximity, did the alerts fire within the same time window?
No single signal is sufficient. Topology alone correlates everything that touches the database (too coarse). Embeddings alone group every "high latency" alert across unrelated services (too coarse the other way). Time alone is purely coincidental, many things happen at the same time without being related.
The combined score is a weighted product of the three signals. The weights are learned from labelled training data, pairs of alerts that the customer's on-call engineers grouped or split during real incidents. The model retrains weekly per tenant; the weights drift as the customer's stack and service map evolve.
The topology graph
The topology graph is the customer's service map, what calls what, what reads from what, what writes to what. We build it from three sources: distributed traces (the OpenTelemetry trace context shows callers and callees), service mesh telemetry (Istio, Linkerd export the call graph directly), and infrastructure-as-code parsing (Terraform and Kubernetes manifests describe the static dependency graph).
The graph is stored in Neo4j with edges for each dependency type, calls, reads, writes, deploys-to. Edges have weights based on traffic volume; a service that calls another rarely gets a low-weight edge, which in turn lowers the topological-proximity score for alerts on those two services.
The graph is kept current via change-data-capture from the source-of-truth telemetry. New service rolled out at 9am? It's in the graph by 9:01. Old service decommissioned? Its node is marked stale within the hour. Stale nodes age out after 72 hours of no signal.
Embedding similarity
Each alert is embedded with a fine-tuned sentence-transformer model trained on our incident corpus. Generic embeddings (OpenAI ada-002, sentence-transformers/all-MiniLM) underperform on SRE-specific text by ~14% in retrieval quality benchmarks. Training a domain-specific model was the highest-ROI ML investment we made.
The embeddings are stored in pgvector with HNSW indexing for nearest-neighbour search. When a new alert arrives, the engine finds the K nearest existing-alert embeddings within the last 30 minutes and computes pairwise similarity. Two alerts with high embedding similarity describe related symptoms; that's a strong signal but not sufficient on its own.
The fine-tune training set is 240k labelled alert pairs from our customer corpus (with consent, anonymised). The model is small (110M params) so inference is fast, embedding a single alert takes under 12ms on the GPU pool we run for this purpose.
Temporal proximity
Temporal proximity is the cheapest signal but the hardest to use well. Two alerts firing at the same second is suggestive; two firing five minutes apart less so; two firing thirty minutes apart almost certainly different incidents.
The engine uses a decaying exponential, same-second pairs get weight 1.0, 30-second-apart pairs 0.6, 5-minute-apart pairs 0.2, 30-minute-apart pairs near zero. The decay rate is learned per tenant; teams with cascading-failure-prone architectures have slower decays (more correlation across longer windows) than teams with isolated services.
One detail that matters in practice: alerts sometimes arrive out of order due to alert-source clock skew or delivery delays. The engine uses the alert's claimed timestamp, not its arrival timestamp. This is a small thing that matters during a high-fire-rate incident.
Why v2.7 got 38% faster
The v2.7 rebuild cut p95 correlation latency from 1.4s to 870ms and p99 from 4.2s to 1.9s. The changes were boring, none of them are novel, all of them mattered.
(1) Hot-path embedding cache. The previous version recomputed embeddings on every correlation pass. The rewrite caches embeddings per alert in an in-memory LRU keyed by alert ID; cache hit rate is 92%. (2) HNSW replaces brute-force search. Nearest-neighbour search dropped from O(n) to O(log n); the practical effect at our customers' scale is 8-12x faster lookup. (3) Query batching. The previous version did one DB round-trip per topology-graph edge lookup; the rewrite batches edges into a single Cypher query.
The combined effect is the 38% p95 cut. None of the individual changes are interesting on their own; they're the kind of profile-led optimisation any reasonable team would do. We did them because the latency floor matters, saving 500ms per alert at our customers' alert volumes saves entire CPU cores per tenant, and the cost compounds across the whole fleet.