FEATURES · ALL INCLUDED, EVERY PLAN

Everything your SRE team needs, powered by AI.

Nova AI Ops is a single AI-native platform that replaces the tangled stack most teams run today: separate tools for monitoring, alerting, on-call, dashboards, logs, and incident response, each with its own bill, login, and learning curve. Built for SRE, DevOps, and platform engineering teams who refuse to accept 3am pages as normal. Every feature on this page is included in every Nova AI Ops plan, no hidden modules, no per-host pricing, no professional services required.

At the core of Nova AI Ops is a fleet of 100 autonomous AI agents organized across multiple specialist domains. These agents continuously ingest metrics, logs, traces, and events from your entire infrastructure, correlate signals across systems, and execute remediation runbooks with confidence scoring. When an incident requires human judgment, Nova pages the right engineer with a pre-built summary containing the root cause, impacted services, suggested fix, and every relevant metric. When an incident can be safely auto-remediated, Nova does it in seconds, before your customers notice.

AI and Automation

The AI layer is what makes Nova different from every AIOps tool built before 2024. We don't use statistical anomaly detection dressed up as AI, we deploy actual autonomous agents that read runbooks, execute diagnostic commands, and make remediation decisions with confidence scoring. Agents own specific categories (pod lifecycle, network policies, certificate management, database health) and coordinate when incidents span multiple domains. Every agent decision is logged with reasoning for compliance and explainability.

100 AI Agents

Specialist agent domains working 24/7. From incident response to security and compliance, each agent is a domain expert.

Core Response, Infrastructure, Cloud, DevOps teams
Real-time trust scores and performance tracking
Full audit trail of every AI decision

AI Runbooks

Auto-generated executable runbooks that learn from your incident history.

AI-generated runbooks for every severity level
What-if simulation before real incidents
One-click execution with approval workflows

Auto-Remediation

AI agents that execute fixes autonomously. Rollbacks, scaling, restarts, and config changes.

Configurable autonomy levels
Approval queue for high-risk actions
Complete rollback capability

Observability

One unified view across metrics, logs, traces, and events. No more jumping between Datadog dashboards, Splunk queries, Grafana panels, and Jaeger waterfalls to piece together what happened. Nova ingests from every major observability source, correlates across signal types, and surfaces the single timeline of cause and effect that your engineers actually need during an incident. Cardinality-independent pricing means you can tag everything without watching your monthly bill explode.

Golden Signals Dashboard

The four pillars of SRE observability in one view. Latency, Traffic, Errors, and Saturation.

Real-time P50, P95, P99 percentiles
Traffic throughput and error rate tracking
Instant comparison against 24h baselines

Real-time Dashboard

A single pane of glass for your entire infrastructure. Live metrics with zero query lag.

Unified view across all clouds and services
Sub-second metric refresh
Custom layouts with drag-and-drop Studio

Log Explorer

Search billions of log lines in milliseconds with anomaly detection and pattern analysis.

Full-text search across all sources
Automatic correlation with traces and metrics
Smart log pattern detection

Alert Management

1000 prebuilt alert rules tuned for the SRE problems teams actually hit. Toggle on what fits your stack, override thresholds per environment, skip the blank rule editor entirely.

Day-one coverage: k8s, Linux, AWS, GCP, Azure, Postgres, Redis, Kafka, NGINX
Each rule ships with a default threshold, runbook, and responding agent
Override anything per environment without forking the rule

Browse the rule library →

Incident Management

Full incident lifecycle management, detection, triage, escalation, remediation, and postmortem, all in one platform. No PagerDuty subscription on top of Nova, no separate incident.io bill, no Rootly integration to maintain. Nova includes on-call scheduling with timezone awareness, escalation policies with acknowledgment timeouts, war room creation with Slack and Zoom bridges, and automated postmortem generation that writes the first draft for you based on the incident timeline and every action taken.

Intelligent Incident Management

AI-powered detection, smart severity scoring, blast radius analysis, and automated lifecycle management.

Cut MTTR from 47 minutes to under 2 minutes
80% less alert noise with AI deduplication
Auto-generated post-mortems

On-Call Management

Intelligent scheduling that respects time zones, workload, and fatigue levels.

Automated rotation with fairness balancing
Smart escalation based on skill match
Multi-channel notifications

AI Post-Mortems

Automatically generated incident reports with timeline reconstruction and root cause analysis.

Auto-generated timeline from signals
Actionable recommendations ranked by impact
Blameless format following SRE best practices

Monitoring and Testing

Synthetic checks from 30+ global regions, real user monitoring (RUM) with Core Web Vitals, uptime monitoring with per-second granularity, and chaos engineering primitives for proactive resilience testing. Nova doesn't just tell you when something is down, it tells you before it's down by learning your baseline over 14 days and flagging drift in latency, error rates, and saturation. Chaos experiments run on schedules you control so you can validate runbooks before you need them.

Synthetic Monitoring

Proactively test your APIs, websites, and critical user flows from 20+ global locations.

Multi-step user flow testing
SSL certificate expiry monitoring
Response time degradation detection

Predictive Detection

ML models trained on your infrastructure patterns detect anomalies before they become incidents.

Pattern recognition from historical data
Capacity forecasting and trend extrapolation
Early warning system with confidence scores

Performance Trends

Track infrastructure performance over weeks and months. Spot degradation trends early.

Long-term trend analysis
Capacity planning with growth projections
Cost optimization recommendations

Platform

Enterprise-grade infrastructure underneath every feature: SOC 2 Type II audit in progress, with controls aligned to HIPAA Security Rule and PCI-DSS (BAA available on request, formal attestation on the roadmap), SAML SSO with any IdP, SCIM provisioning, role-based access control down to the resource level, and full audit logging with 7-year retention for regulated industries. Blue/green deployments for zero-downtime rollouts, 99.9% uptime SLA target for enterprise tiers, private cloud options for customers who require data residency, and a public API that every UI action is built on top of, anything you see in the Nova UI you can automate.

Service Catalog

Real-time health, dependency mapping, and SLO compliance across every service.

Interactive dependency topology map
SLO tracking with error budget burn rates
Automatic service discovery

Approval Manager

Enterprise-grade change management for AI-driven actions with full audit trails.

Role-based approval workflows
Risk scoring for every AI action
SOC 2 Type II in progress; controls aligned to AICPA TSC

Nova Shell

AI-powered terminal that translates natural language into infrastructure commands.

Natural language to kubectl, SQL, AWS CLI
Safe mode with dry-run preview
Full command history with rollback

AI Safety and Guardrails

Eight layers of guardrails so your agents never break what they are supposed to fix. Hard caps, dry-runs, post-action verification, and human-in-the-loop on the calls that matter. This is what makes agentic SRE safe to point at production.

Agent Kill Switch

One click halts any agent, any tenant, or the entire platform when something starts misbehaving.

Global, per-tenant, and per-agent stop levels
Halts in under a second across all regions
Bypass policy for sandboxed simulators, locked for prod

Read the docs →

Prompt Injection Defense

Detects and neutralizes adversarial input before it reaches an LLM, so a hostile log line cannot take over an agent.

Multi-layer input sanitisation
Out-of-band egress scanner for leaked secrets
Provider-side guard with confidence scoring

Read the docs →

Cost Circuit Breaker

Hard ceilings on AI spend per tenant and per agent, with auto-halt when a runaway agent burns through budget.

Per-tenant, per-model, per-agent caps
Real-time budget burn-rate alerts
Auto-pause if a model goes over its quota

Read the docs →

Error Budget Gate

Risk-gated remediation. If an SLO is depleted, we block destructive automations until a human reviews.

SLO-aware action gating
Tied to error-budget burn rates per service
Forces human approval when risk is high

Read the docs →

Tenant Isolation

Cross-tenant access attempts are caught and blocked at the boundary. Customer A cannot see Customer B data, full stop.

Hard isolation at storage, network, and runtime
Audit trail for every cross-tenant attempt
Default deny on shared resources

Read the docs →

Ground-Truth Verifier

After every remediation, agents re-probe metrics at T+5m, T+1h, and T+24h to confirm the fix held.

Post-action verification windows
Auto-rollback if any probe fails
Feeds the trust score for each agent

Read the docs →

Consensus Arbiter

When multiple agents disagree on root cause or fix, the arbiter resolves the conflict deterministically before any action runs.

Multi-agent debate resolution
Tie-breaker rules per scenario type
Full audit trail of every arbitration

Read the docs →

Simulation Engine

Digital-twin dry-run of any agent action before it touches production. See the predicted impact, then approve or reject.

Counterfactual replay of past incidents
What-if preview for any AI runbook
Zero production blast radius

Read the docs →

Quick Setup

Most teams are fully operational in under 5 minutes. One-line agent install, OAuth-based cloud account connection, and auto-discovery of your services, databases, and existing monitoring tools. Nova imports your existing Datadog dashboards, PagerDuty escalation policies, and Grafana alert rules automatically, you don't have to rebuild anything. Run Nova alongside your existing stack during the evaluation period and deprecate tools at your own pace once you trust the results.

Install in Under 3 Minutes

One command installs the Nova AI Agent on any Linux or macOS server. The agent monitors CPU, memory, disk, network, processes, containers, and logs automatically.

Single command install on any Linux or Unix server
Automatic systemd service creation and startup
Instant connection to 100 AI agents
Supports Ubuntu, Debian, CentOS, RHEL, Amazon Linux, macOS

View full install guide →

Get Started