Nova v2.7: AI Post-Mortems and Correlation-Engine Speedup
Auto-assembled post-mortems with human sign-off, plus a 38% p95 latency cut on the correlation engine. Two of the largest changes to the platform since v2.0 launched.
Why post-mortems were broken
Every team agrees post-mortems matter. Almost no team writes them on time. The median time-to-published-post-mortem in our customer base before v2.7 was 8 days. The 90th percentile was 31 days. By the time the document lands, the on-call rotation has moved on, the action items don't get assigned, and the lessons stay in someone's head.
The bottleneck isn't analysis, the engineers know what went wrong by the time they close the incident. The bottleneck is assembly. Pulling timestamps from chat, screenshotting graphs, finding the relevant log lines, copying customer-impact numbers from the support team. It's two to four hours of mechanical work per incident, and it always falls on someone who's already tired from being on-call.
v2.7 ships the Postmortem agent, a specialised agent in the 100-agent fleet whose job is to assemble that draft from the data Nova already has, then route it for human sign-off. The agent doesn't replace the writer; it removes the assembly tax.
How the Postmortem agent works
The agent has read access to the incident timeline (every status update, every action taken, every chat message in the war room), the affected services and their golden signals during the window, the action ledger from the Remediate agent, and the customer-impact numbers from the support integration. That's the input.
The agent produces a structured draft with the standard sections, summary, timeline, impact, root cause, what went well, what didn't, action items. The summary is one paragraph that an executive can read in 90 seconds. The timeline is the actual incident timeline with explanatory prose between the events. The root cause section pulls from the diagnosis chain that the Diagnose agent built during the incident.
The action items come from a different model entirely, a smaller fine-tuned classifier that reads the chat history and pulls out implicit commitments ("we should add a circuit breaker here," "we need to alert on this earlier"). False positive rate is the thing we tuned hardest; better to miss an action item than to fabricate one.
The whole assembly takes 30-90 seconds for a typical incident. The output lands in the post-mortem editor as a draft, with every fact linked back to its source, click any timestamp to see the underlying chat message; click any metric claim to see the graph it came from.
The human sign-off loop
The agent never publishes. Every draft requires human sign-off, the on-call commander reviews the document, edits whatever needs editing, and clicks Publish. The audit ledger records who signed off and when. This is non-negotiable; an AI-authored document published without a human on the byline is a credibility problem we don't want to introduce.
What's changed in our customer cohort: median time-to-published-post-mortem dropped from 8 days to 18 hours. 90th percentile from 31 days to 4 days. Edits-per-document is high (median 14 edits), the humans are doing real review, not rubber-stamping. That's the loop working.
The other thing we measured: the percentage of incidents that get a post-mortem at all. Before v2.7, customers post-mortemed roughly 40% of their P1 and P2 incidents. After v2.7, that's 89%. Most of the gap was tasks that just never got done; lowering the activation cost converts them into completed work.
Correlation-engine rebuild
The other big v2.7 change is under the hood. We rebuilt the correlation engine, the system that takes 200 raw alerts and turns them into a single incident with a single owner. The previous implementation was a streaming pipeline with a Postgres backing store; the bottleneck was the embedding lookup at scale.
The rewrite uses pgvector with HNSW indexing and an in-memory cache layer for the hot working set (alerts in the last 30 minutes). The topology graph is now a separate Neo4j instance kept in sync via change-data-capture from the service registry. Correlation runs as a join across the embedding index and the topology graph.
p95 correlation latency dropped from 1.4s to 870ms, a 38% cut. p99 dropped harder, from 4.2s to 1.9s. The latency floor matters because correlation runs on every incoming alert; saving 500ms per alert at our customers' alert volumes saves entire CPU cores per tenant.
What got measured
Two metrics tell the v2.7 story. Time-to-published-post-mortem dropped from 8 days median to 18 hours. Correlation p95 latency dropped 38%. Both numbers are from the customer cohort that ran the v2.7 RC for two weeks before GA.
Both features are live in production for all tenants as of today. No setup; the Postmortem agent activates the moment an incident closes; the correlation engine rebuild is transparent. If you've used Nova in the last 24 hours, you've already used both.
Next on the roadmap: extending the Postmortem agent to author RCAs in customer-facing language for status-page consumption, and pushing correlation p95 below 500ms by moving the embedding model inference closer to the database.