SLO and Graceful Degradation
Graceful degradation preserves SLO.
Modes
Graceful degradation is the difference between a service that fails partially and one that fails totally. When upstream dependencies are struggling or the service itself is under pressure, the right design serves a degraded but useful response instead of an error. The first piece of the design is naming the modes explicitly so the team can reason about which one is currently active.
The standard mode hierarchy:
- Full mode.: Everything works. All features available, all writes accepted, all reads use fresh data. This is the normal operating state and the SLO target is calibrated against this mode.
- Read-only mode.: Writes are rejected (with a clear error message and retry-after hint), reads still work. This protects the database or downstream service from write pressure during incidents. Most user requests are reads, so a read-only mode keeps most of the user experience intact.
- Cached-only mode.: Reads come from cache, possibly stale, with a header indicating freshness. The backend is bypassed entirely. The customer sees data that may be a few minutes out of date but the service responds.
- Static mode.: A pre-rendered or constant response is returned for every request. No backend, no cache, just a static page or a known-good payload. Used when even cache is unavailable.
- Down.: The service returns a 503 with a clear retry policy and a link to the status page. The last resort, used only when no degradation mode can serve a useful response.
The point of explicit modes is that the team can decide ahead of time which mode to use under which conditions. Without explicit modes, every incident becomes an improvisation, and improvisations under stress produce worse outcomes than rehearsed degradation.
Trigger
The trigger is what flips the service between modes. The discipline is to make this automatic and SLO-aware, not a manual decision in the middle of an incident.
- SLO-aware: at risk equals degrade.: When the burn rate threatens to consume the SLO budget faster than the budget can replenish, the service automatically downgrades to a degraded mode. The trigger is the burn rate, not a manual call. The service decides for itself based on a measurable signal.
- Layered triggers.: Different triggers for different modes. Database connection pool over 90% utilization triggers read-only. Database fully unavailable triggers cached-only. Cache unavailable triggers static. Each trigger is independent and the mode that activates is the most-degraded one whose trigger fires.
- Hysteresis on the trigger.: Add a delay before flipping back to a less-degraded mode. The pattern is "trigger to degrade fires immediately, trigger to recover requires the condition to be clear for at least 5 minutes." This prevents flapping between modes during a recovering incident.
- Auto, but with override.: The on-call can manually flip the service into a more-degraded mode if they see something the automated trigger does not. The on-call cannot easily override the system into a less-degraded mode if conditions are still bad; the safe default is on the system's side, not the human's.
- Customer-visible mode indicator.: When the service is in a degraded mode, the response includes a header (X-Service-Mode: read-only) and the UI shows a banner. This makes the degradation explicit to the user instead of mysterious; they know to retry later instead of assuming the system is broken.
SLO-aware triggers turn graceful degradation from a hopeful design into an active part of the operational practice. The service degrades early enough to preserve the SLO instead of degrading after the SLO has already been blown.
Recover
The recovery direction is just as important as the degradation direction. Coming back to full mode too aggressively reopens the same conditions that triggered degradation; coming back too slowly leaves the user experience worse than necessary.
- Auto-recovery when SLO is safe.: The service watches the same SLO and burn-rate signals on the way back. When the burn rate has stayed clean for the hysteresis window, the service auto-promotes to a less-degraded mode. The recovery is automatic, just like the degradation.
- Both directions instrumented.: Every mode change (degrade and recover) emits an event. Time spent in each mode is tracked. The retro after an incident has a complete record of how the service moved through modes and how long it stayed in each.
- Step-wise recovery, not jumps.: Recover one mode at a time, with the hysteresis window between each step. Going from cached-only to full in one jump skips the read-only layer that would have caught residual write pressure. Step-wise recovery is safer and not noticeably slower.
- Communicate the mode change.: Status page updates, deploy channel posts, and customer-visible banners all reflect the recovery as it happens. Customers who were waiting for "back to normal" want to know when it actually happens.
- Don't recover during ongoing incident response.: If the on-call is actively investigating, hold the auto-recovery. The signal is good enough to tell the system to step back; the on-call's judgment is what tells the system "the underlying problem is actually fixed, you can recover for real now."
Graceful degradation done right preserves your SLO during incidents that would otherwise blow it. The user gets a degraded but useful experience; the team gets time to fix the root cause; the budget gets protected. Nova AI Ops watches SLO burn rate, triggers degradation modes when configured thresholds fire, and auto-recovers when the burn-rate signal stays clean through the hysteresis window so the service preserves itself without manual intervention.