Nova v2.0: The Platform Launch

Today we launched Nova AI Ops as a full platform. 100 specialized agents working across diagnosis, remediation, audit, and learning. AI-drafted post-mortems. Unified observability and incident management on the same data plane. Here's what shipped and why.

Why we built a platform

The AIOps category split into a half-dozen point products over the last decade, alert correlation, anomaly detection, runbook automation, post-mortem authoring, observability, incident management. Each is a tool. Each tool has its own data model, its own access controls, its own onboarding. The on-call engineer running through six tabs during a P1 is the failure mode.

v2.0 is the bet that the right shape is one platform with one data model, one identity layer, one audit trail, and one event bus, with the specialised functionality (correlation, remediation, post-mortems) built as agents on top. Not a suite. Not a bundle. A platform.

That's a long-cycle bet. We've been building it for two years. Today is the public launch.

The 100-agent fleet

Nova ships with 100 specialised agents organised across 12 functional teams: Diagnose, Remediate, Detect, Audit, Learn, Communicate, Plan, Verify, Investigate, Score, Predict, Document. Each agent has a narrow scope, the DB Latency Diagnose agent does one thing and does it well. Compose them and you get end-to-end incident response.

The agents share a common substrate: a tool palette of 80+ vetted operations (kubectl, AWS CLI, Datadog query, Slack message, etc.), a shared memory store keyed by tenant and incident, and a shared evaluation harness so we can run regression tests across the whole fleet on every change.

The agents are not chained linearly. The runtime is a graph, an open incident triggers the relevant agents based on the signals; agents emit events that wake up other agents; the Audit agent records every step. The graph topology is the platform's interesting bit; the individual agents are mostly small.

Unified observability and IM

Most tools force you to pick a side. Datadog and New Relic are observability with bolted-on incident workflow. PagerDuty and FireHydrant are incident management with bolted-on metric pulls. Both halves are dilute when bolted onto the wrong primary.

Nova was built with metrics, logs, traces, and incidents on the same data plane from day one. The same query language hits all four. The same dashboards can pin metric panels next to incident timelines. The same access controls govern who can see what across all data types.

The practical effect: during an incident, the on-call engineer doesn't context-switch between tools to correlate a metric anomaly with a log spike with the customer-facing incident. The signals are already linked at the data layer. The Diagnose agent can reach across all of them in a single query.

AI Post-Mortems

v2.0 includes the first version of AI Post-Mortems, the Postmortem agent assembles a draft from the incident timeline, the actions taken, the customer impact, and the chat history. The draft lands in the post-mortem editor with every fact linked back to its source. A human reviews, edits, signs off, and publishes.

This is the feature that closed the loop. Before AI Post-Mortems, "incident closes" meant the on-call commander had a tired person's queue of follow-up writing to do tomorrow. After AI Post-Mortems, "incident closes" means a draft is ready for review tomorrow. Median time-to-published moved from 8 days to 18 hours in the customers running the v2.0 RC.

v2.7 will refine the agent further (faster, more accurate root-cause inference, better action-item extraction) but the v2.0 version is good enough to ship.

What we built it on

The infrastructure is unfashionable on purpose. Postgres for transactional data, ClickHouse for telemetry, Neo4j for the topology graph, pgvector for embeddings, S3-compatible blob for artifacts. Kafka for the event bus. Kubernetes for everything compute. No exotic primary stores; no proprietary databases; no "we built our own."

The reason is operational. We're an SRE platform; we run our own infrastructure as if our customers' incidents depend on it (they do). Boring tech is what you want when you're the SRE for an SRE platform.

Where we did build custom: the agent runtime, the topology-aware correlation engine, the post-mortem assembler, and the streaming-encryption layer for Nova Transfer. Those are the pieces where the off-the-shelf options didn't fit; everywhere else, we used what was already proven.

What's next

Three big themes for the next six months. (1) Pushing more incidents to fully automated closure, the agents close roughly 30% of low-risk incidents today; we want 60%. (2) Predictive detection, agents that flag the regression before it becomes the incident. (3) Single-tenant and BYOK deployment for the regulated-industry buyers.

v2.0 is live for new and existing customers as of today. Existing customers have been auto-migrated; nothing to do. New customers can sign up at app.novaaiops.com. The platform is the product; everything else is detail.

This took a lot of people a long time. Thanks to the team and to the customers who ran the RC and told us what was broken. The bigger version is ahead.