AI Agent Operations

Your LLM provider is part of your stack,
so it gets an SLO too

Provider Health watches every LLM provider you use as if it were one of your own services. Per-provider p50/p95/p99 latency, error rate, and rate-limit headroom. When a provider starts to degrade, Nova routes around it (failover to a secondary provider or a cached response) before the degradation becomes your incident.

Get Started Talk to Sales
app.novaaiops.com / provider-health
● LIVE
Per-provider
SLO bands
Auto
failover on degrade
< 30s
detection latency
Audit-logged
every routing
Per-Provider SLO

Same machinery as your services

Each provider gets the same SLO treatment as one of your services: target p95 latency, target error rate, target rate-limit headroom. Burn-rate alerts fire on multi-window thresholds (6h × 2x = page, 24h × 1x = notify). The provider is held to a number, not a vibe.

  • Three SLOs per provider: p95 latency, error rate, rate-limit headroom, concrete targets, configurable
  • Multi-window burn alerts: same machinery as Service Health Matrix; the page reuses SLO Management primitives
  • Visible on the matrix: providers show up as additional rows on Service Health Matrix so they are not invisible
app.novaaiops.com / provider-health · slo
Auto-Failover

Routing follows the SLOs

When a provider trips its degraded threshold, Nova routes traffic to a secondary provider for the affected workload class. Routing is gradual (10% increments) and reversible. A provider returning to healthy automatically reclaims its share at the same gradual cadence. No "all-or-nothing" cutovers that introduce their own risk.

  • Gradual shift: 10% per minute; gradual ramp out and gradual ramp back; never a hard cutover
  • Workload-aware: classify tasks (cheap with Haiku) move first; expensive tasks (Opus) move last
  • Reversible: when health returns, traffic returns at the same cadence, no permanent migrations
app.novaaiops.com / provider-health · failover
Cost-Aware Routing

Cheaper providers picked when quality is equivalent

Routing also accounts for cost. Two providers with similar p95 and quality on a class? Nova picks the cheaper one. The cost data comes from Cost Circuit Breaker so the routing decisions are aware of your current budget posture (closer to limit = cheaper provider weighted higher).

  • Cost-aware tiebreak: when two providers are equivalent on quality, the cheaper wins
  • Budget-aware: when you are 80% through your budget, cheap providers get heavier weighting
  • Quality not sacrificed: cost only factors when quality is statistically equivalent, never trade quality for cents
app.novaaiops.com / provider-health · cost
Audit

Every routing decision is logged

Routing changes are logged to Agent Ledger like any other agent action. Triggering condition (which SLO breached), gradual ramp steps, recovery, and final state. Use the audit to explain "why did our LLM bill spike on April 22?", usually the answer is a 4-hour failover to a more expensive provider.

  • Triggering SLO: every routing change records which SLO breach caused it
  • Step-by-step trail: every 10% shift is a row, with timestamp and reason
  • Cost attribution: cost-side bills tag rows with "failover routing" so finance reviews are easy
app.novaaiops.com / provider-health · audit
Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

When the provider goes weird, you do not

Multi-provider routing is only as good as the signal that drives it. Provider Health is that signal.

Get Started Request a Demo