Reliability Engineering

Every service, every SLO, one grid,
green/yellow/red without the noise

Service Health Matrix is the reliability roll-up. One row per service, one column per SLO, color-coded compliance. Click any cell to drill into the underlying SLI, the burn-rate window, and the active alerts. Designed for the morning standup and the 3am page alike.

Get Started Talk to Sales
app.novaaiops.com / service-health-matrix
● LIVE
96+
Services tracked
5
Default SLOs per service
< 60s
Cell refresh interval
30d
Burn-rate window
SLO Compliance per Cell

A cell is a service crossed with an SLO

Each row is a service, each column is one of the SLOs you have defined for it (latency, availability, error budget, saturation, freshness). The cell color is computed from the live SLI compared to the SLO target over the rolling 30-day window. Green is on track, yellow is burning fast, red is over budget.

  • Five default SLOs: p95 latency, availability, error rate, saturation, freshness, pre-configured per service tier
  • Custom SLOs: add your own (cart-checkout success, payment-success, transcoding-success), they get a column
  • Three-state coloring: green (on track), yellow (burning faster than target), red (over budget, page on-call)
app.novaaiops.com / service-health-matrix · cell
Burn-Rate Windows

Multi-window burn rate, not single-point

Each cell tracks burn rate across short and long windows (1h, 6h, 24h, 30d) so a brief blip does not flip the cell red and a slow drift does not stay green forever. The cell shows the burn-rate ratio (how many times faster than budget the service is burning) and the projected exhaustion date.

  • Four windows per cell: short (1h, 6h) catch acute regressions, long (24h, 30d) catch slow drifts
  • Burn ratio shown numerically: "1x" is on target, "5x" is burning a month's budget in 6 days
  • Projected exhaustion: estimated date the budget runs out at the current rate, used to time fix-or-cut decisions
app.novaaiops.com / service-health-matrix · burn
Filter & Roll-Up

See the matrix at the right altitude

Filter by team, by tier (tier-0 / tier-1 / tier-2), by environment, by region. Roll up by team to get a one-row view per team. Drill down into any cell to see the underlying SLI definition, the queries that compute it, and the alerts that fire on burn-rate breaches.

  • Saved views per role: an exec view (one row per team), an SRE view (every service), an on-call view (only red and yellow)
  • Role-based defaults: platform-admin sees everything; team lead sees their team; SRE on-call sees the alerting cells
  • Persistent filter URLs: every filter combination is a URL you can paste into Slack so the team sees the same matrix you do
app.novaaiops.com / service-health-matrix · roll-up
Alerts & On-Call Hooks

Cells turn red and someone gets paged

When a cell crosses the red threshold (over budget, fast-burn breach), Nova fires an alert into the On-Call rotation for the owning team. The alert payload includes the service, the SLO, the burn rate, the projected exhaustion, and a deep link back to the cell. No "where do I go?" guessing.

  • Owner-routed alerts: every service has an owner team; the alert routes to that team's rotation, not a generic firehose
  • Deep-link payload: the alert text contains a URL to the exact cell that breached, with the right time window pre-loaded
  • Auto-correlate with deploys: breached cells are auto-cross-referenced with Nova Rewind so the cause shows up in the same alert
app.novaaiops.com / service-health-matrix · alerts
Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

A reliability dashboard you actually look at

Most reliability tools show a single org-wide score. Service Health Matrix shows the truth: which service, which SLO, by how much.

Get Started Request a Demo