Reliability Engineering

Define what reliable means,
and let Nova hold every service to it

SLO Management is where you write down what reliable looks like for each service. SLI definitions, SLO targets, burn-rate alert rules, error budget policies. Once defined, every service in Service Health Matrix tracks against your numbers, and the agents respect them when deciding whether to act.

Get Started Talk to Sales

app.novaaiops.com / slo-management

● LIVE

SLO · payments × p95 latency

# SLI name: payments_p95 type: latency_quantile query: p95(http.duration_ms{service="payments"}) # SLO target: "< 200ms" window: 30d rolling burn_alert: fast: 6h × 2x → page slow: 24h × 1x → notify

SLI Definition

Pick a signal, write a query

An SLI (Service Level Indicator) is the measurable thing. Latency, availability, error rate, freshness, anything you can express as a query against your signals. Nova ships templates for the common ones and a free-form mode for custom indicators. Every SLI is testable in a sandbox before you commit it.

✓
Five built-in SLI types: latency_quantile, availability_ratio, error_ratio, saturation, data_freshness, works on day one
✓
Custom SLIs in NovaQL: write any query that returns a number per minute and Nova will track it as an SLI
✓
Sandbox runs before commit: see what the SLI would have looked like over the past 30 days before you set the target

app.novaaiops.com / slo-management · sli

SLI sandbox · payments_p95

LAST 30 DAYS

p95 (30d)218ms

min142ms

max412ms

compliance vs 200ms94.2%

SLO Targets & Windows

Pick the number, pick the window

A target ("p95 under 200ms") plus a window ("30 days, rolling") makes an SLO. Nova supports rolling windows (last 30 days) and calendar windows (this month, this quarter). Calendar windows reset; rolling windows do not. Pick the one that matches your reporting cadence.

✓
Rolling vs calendar: rolling = always last N days, calendar = this month / quarter / year, pick per SLO
✓
Composite SLOs: an SLO can require multiple SLIs ("p95 under 200ms AND error rate under 0.1%") for stricter governance
✓
Per-tier defaults: tier-0 services start at 99.9%, tier-1 at 99.5%, tier-2 at 99%, override per service as needed

app.novaaiops.com / slo-management · target

SLO · checkout-success

target99.5%

window30d rolling

tiertier-1

composite2 SLIs (success ratio, p95)

ownerpayments-team

Burn-Rate Alerts

Two windows, two alerts, no flapping

Nova uses the multi-window multi-burn-rate pattern from the Google SRE workbook. A short window with a high burn (6h × 2x) pages on-call when something is acutely wrong. A long window with a lower burn (24h × 1x) notifies the team when something is slowly draining. Two alerts, no false alarms.

✓
Fast-burn alert: 6h window × 2x burn rate threshold → page on-call (acute)
✓
Slow-burn alert: 24h window × 1x burn rate threshold → notify team channel (drift)
✓
Tunable per SLO: override windows and ratios for SLOs where the defaults do not fit your traffic shape

app.novaaiops.com / slo-management · alerts

Alert config · payments_p95

fast6h × 2x → page on-call

slow24h × 1x → slack notify

quiet hoursnone (page anytime)

auto-correlateon (Nova Rewind)

Versioning & Review

SLOs are code, reviewed and versioned

Every SLO change creates a new version with a diff, an author, a reason, and an optional reviewer. Tighten a target by accident? Roll it back to the prior version with one click. Export the whole library to YAML for IaC. Import from YAML for GitOps. SLOs that survive review are SLOs that survive an audit.

✓
Diff + author + reason: every change shows what was changed, by whom, when, and why, visible from the SLO detail page
✓
YAML export and import: GitOps-friendly: store SLOs as YAML in your repo, sync them to Nova on merge
✓
Reviewer gate (optional): require a reviewer for tier-0 SLO changes so a junior cannot loosen a critical target alone

app.novaaiops.com / slo-management · history

History · payments_p95

v 12tightened target 220ms → 200ms · marc

v 11added composite SLI · sarah

v 10changed window 7d → 30d rolling · marc

v 9initial · marc · 2025-11-04

Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

Reliability you can govern, not just dashboard

SLOs in Nova are first-class objects. Versioned, reviewable, and enforced by the agents and the alert pipeline.

Get Started Request a Demo

Define what reliable means,and let Nova hold every service to it