AI Agent Operations

Your LLM provider is part of your stack,
so it gets an SLO too

Provider Health watches every LLM provider you use as if it were one of your own services. Per-provider p50/p95/p99 latency, error rate, and rate-limit headroom. When a provider starts to degrade, Nova routes around it (failover to a secondary provider or a cached response) before the degradation becomes your incident.

Get Started Talk to Sales

app.novaaiops.com / provider-health

● LIVE

Providers · last hour

anthropichealthy · p95 412ms

rate limit62% used

openaidegraded · p95 1.4s

rate limit88% used

actionrouted 38% of traffic to anthropic

googlehealthy · p95 480ms

Per-Provider SLO

Same machinery as your services

Each provider gets the same SLO treatment as one of your services: target p95 latency, target error rate, target rate-limit headroom. Burn-rate alerts fire on multi-window thresholds (6h × 2x = page, 24h × 1x = notify). The provider is held to a number, not a vibe.

✓
Three SLOs per provider: p95 latency, error rate, rate-limit headroom, concrete targets, configurable
✓
Multi-window burn alerts: same machinery as Service Health Matrix; the page reuses SLO Management primitives
✓
Visible on the matrix: providers show up as additional rows on Service Health Matrix so they are not invisible

app.novaaiops.com / provider-health · slo

SLOs · anthropic

p95 target< 800ms

error rate target< 0.5%

rate-limit headroom> 20%

30d compliance99.4%

Auto-Failover

Routing follows the SLOs

When a provider trips its degraded threshold, Nova routes traffic to a secondary provider for the affected workload class. Routing is gradual (10% increments) and reversible. A provider returning to healthy automatically reclaims its share at the same gradual cadence. No "all-or-nothing" cutovers that introduce their own risk.

✓
Gradual shift: 10% per minute; gradual ramp out and gradual ramp back; never a hard cutover
✓
Workload-aware: classify tasks (cheap with Haiku) move first; expensive tasks (Opus) move last
✓
Reversible: when health returns, traffic returns at the same cadence, no permanent migrations

app.novaaiops.com / provider-health · failover

Failover · openai degrade

14:18openai p95 > 1s for 3m

14:19shift 10% to anthropic

14:25shift 38% to anthropic (gradual)

15:02openai recovered · ramp back begins

Cost-Aware Routing

Cheaper providers picked when quality is equivalent

Routing also accounts for cost. Two providers with similar p95 and quality on a class? Nova picks the cheaper one. The cost data comes from Cost Circuit Breaker so the routing decisions are aware of your current budget posture (closer to limit = cheaper provider weighted higher).

✓
Cost-aware tiebreak: when two providers are equivalent on quality, the cheaper wins
✓
Budget-aware: when you are 80% through your budget, cheap providers get heavier weighting
✓
Quality not sacrificed: cost only factors when quality is statistically equivalent, never trade quality for cents

app.novaaiops.com / provider-health · cost

Routing weights · classify class

haiku-4-562% · cheap, fast, equivalent

gpt-4o-mini28%

gemini-1.5-flash10%

est savings$420 / week

Audit

Every routing decision is logged

Routing changes are logged to Agent Ledger like any other agent action. Triggering condition (which SLO breached), gradual ramp steps, recovery, and final state. Use the audit to explain "why did our LLM bill spike on April 22?", usually the answer is a 4-hour failover to a more expensive provider.

✓
Triggering SLO: every routing change records which SLO breach caused it
✓
Step-by-step trail: every 10% shift is a row, with timestamp and reason
✓
Cost attribution: cost-side bills tag rows with "failover routing" so finance reviews are easy

app.novaaiops.com / provider-health · audit

Audit · last 30d

failovers triggered4

recoveries4 (all)

avg dwell2h 14m

cost impact+$180 (failover surcharge)

Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

When the provider goes weird, you do not

Multi-provider routing is only as good as the signal that drives it. Provider Health is that signal.

Get Started Request a Demo

Your LLM provider is part of your stack,so it gets an SLO too