Provider Health watches every LLM provider you use as if it were one of your own services. Per-provider p50/p95/p99 latency, error rate, and rate-limit headroom. When a provider starts to degrade, Nova routes around it (failover to a secondary provider or a cached response) before the degradation becomes your incident.
Each provider gets the same SLO treatment as one of your services: target p95 latency, target error rate, target rate-limit headroom. Burn-rate alerts fire on multi-window thresholds (6h × 2x = page, 24h × 1x = notify). The provider is held to a number, not a vibe.
When a provider trips its degraded threshold, Nova routes traffic to a secondary provider for the affected workload class. Routing is gradual (10% increments) and reversible. A provider returning to healthy automatically reclaims its share at the same gradual cadence. No "all-or-nothing" cutovers that introduce their own risk.
Routing also accounts for cost. Two providers with similar p95 and quality on a class? Nova picks the cheaper one. The cost data comes from Cost Circuit Breaker so the routing decisions are aware of your current budget posture (closer to limit = cheaper provider weighted higher).
Routing changes are logged to Agent Ledger like any other agent action. Triggering condition (which SLO breached), gradual ramp steps, recovery, and final state. Use the audit to explain "why did our LLM bill spike on April 22?", usually the answer is a 4-hour failover to a more expensive provider.
Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.
Multi-provider routing is only as good as the signal that drives it. Provider Health is that signal.