Product Updates Beginner By Samson Tanimawo, PhD Published Sep 2, 2026 8 min read

Nova AI Ops Launch, March 2026

The launch story. Why we built an agentic SRE platform from scratch, what shipped on day one, and what we learned in the first six weeks of customers.

Why we built it

I spent years on-call at infrastructure teams that operated at the scale where everything is on fire all the time. The pattern that always frustrated me: most incidents had a known answer, the answer was sitting in a runbook somewhere, and the engineer paged at 3am still spent 40 minutes finding the runbook, reading it, deciding it applied, and running the steps. The work wasn't hard; the friction was the work.

The traditional observability stack made the friction worse, not better. Metrics in one tool, logs in another, traces in a third, runbooks in a wiki nobody trusts, alerts going to a Slack channel everyone's muted. An incident response that should be five minutes took an hour because the engineer had to manually stitch the picture together from five sources.

The thesis behind Nova: the LLM revolution finally makes it possible to do that stitching automatically, not just summarisation, but actual decision-making against a known runbook library, with safety rails and a human-approval surface for the things that matter. Not "let the AI do incident response unsupervised", that's reckless. "Let the AI do the obvious 80% under supervision and surface the rest to the human in seconds, not hours." That's the bet.

What shipped on day one

March 21, 2026 was the public-launch date. The product that shipped had four agents, Diagnose, Remediate, Audit, Learn, running against a single platform. Diagnose ingests metrics, logs, traces; runs correlation; surfaces a hypothesis. Remediate matches the hypothesis to a known runbook and proposes the next action. Audit writes a ledger entry for every action so the post-mortem trail is automatic. Learn updates the runbook library based on what worked and what didn't.

Underneath the agents: a unified data layer for metrics, logs, and traces. Auto-instrumentation for Kubernetes, Linux hosts, and the eight most common application runtimes. Integrations for Slack, PagerDuty, GitHub, Linear, and Jira. A dashboard surface for service health, an incidents page for active and historical incidents, and a runbook library shared across the tenant.

The pricing was, and is, usage-based with a Basic tier. Basic tier is enough to instrument 5 services and run the agents end-to-end. Paid plans scale by service count and signal volume. We made the Basic tier deliberately generous because the product gets meaningfully better the more services you connect; the network effect is internal to a tenant, not across them.

The first six weeks

Weeks 1-2 were the rush. The launch landed on Hacker News and a handful of newsletters; sign-ups peaked at about 800 a day in week 1, settled to a steady ~150 a day by end of week 2. About 18% of sign-ups instrumented at least one service in their first session. The biggest drop-off in onboarding was at "register your first service", too many fields, too much to read. We cut the form to two fields by end of week 3.

Weeks 3-4 were the iteration. We watched session replays of the engineers actually using the product (with consent), found the rough edges, fixed the obvious ones daily. The single highest-impact fix was making the runbook library searchable from the incident page, engineers were tab-switching to find runbooks; making search inline cut incident-page session length by 40%.

Weeks 5-6 were the first wave of paid conversions. The teams that converted shared a profile: small platform team (5-15 engineers), already running on Kubernetes, already had a basic observability stack but were tired of stitching it together. Nova replaced 2-4 tools in their stack and gave them a usable on-call experience for the first time. The pitch that worked best wasn't "we're better than Datadog", it was "we're the agent layer your existing stack doesn't have."

What we learned

Three lessons from the first six weeks. First, the agent ledger was more important than we realised. We built it as an audit trail; engineers used it as a way to build trust in the agents. "Show me what Nova actually did" was the most common request from engineers evaluating the product. The ledger is the answer; we've since promoted it from a settings-tab to a top-level surface.

Second, runbooks are the bottleneck. The agents are only as good as the runbook library; tenants with 0-5 runbooks got limited value, tenants with 30+ runbooks got transformative value. We've since built a runbook authoring tool that suggests runbook drafts based on resolved incidents, lowering the cost of building the library.

Third, the on-call human still needs to be in charge. Every attempt we made to push toward "let Nova handle this without your involvement" was met with reasonable resistance. Engineers don't want fewer pages, they want faster context when they get one. The product reframe in week 4, "Nova reduces time-to-context, not page count", was the messaging that landed best in our calls and the one we still use.

What's next

The Q3 roadmap is published separately and details the work explicitly. The summary version: deeper auto-remediation (more of the obvious 80% handled automatically), multi-region failover (the feature most large-tenant prospects ask about), and an agent SDK (so platform teams can build their own agents on Nova's plumbing).

The harder work for the rest of the year is ecosystem. We have integrations for the obvious tools; we don't yet have integrations for half the niche tools real teams run. We're prioritising the integrations our paying customers actually need over the generic "every observability tool" list. If you're using something we don't have, the support inbox is the right place to file the request, and we read every one.

If you're new here and reading this because you saw the launch land, the easiest next step is the Basic tier. Connect a service, watch the agents work for a week, decide whether the time-to-context improvement is real for your team. The pitch is simple: the friction in incident response is the work, and the work is what Nova is built to remove.