The Degraded-Mode Runbook
When the system can't fully serve, what's the safe partial mode? The runbook that defines.
Define modes
Degraded modes are the discipline of choosing partial-but-safe over total failure. Each mode has an explicit feature scope so on-call can read the document at 3am and know exactly what is on, what is off, and what the user-visible difference is.
- Full mode. All features available. Default operational posture; what customers expect to see.
- Degraded mode. Read-only or feature-shedding state. Core function preserved under stress at the cost of writes or non-essential features.
- Minimal mode. Cached responses only. Last line before total failure; preserves the page-loads-but-actions-disabled experience.
- Documented mode definitions per service. Explicit per-mode feature list lives in the runbook. Catches the “what does degraded actually mean” question at 3am.
Triggers
Triggers are explicit and signal-driven. Auto-trigger where the math is clean; manual escalation where the signal is ambiguous and judgment is required.
- Specific signals push into each mode. Latency, error rate, downstream health. Named thresholds per mode.
- Auto-trigger where possible. Encoded automation removes human latency at the worst time. The page does not need to fire first.
- Manual escalation if needed. IC-driven mode shift when the signal is ambiguous. Judgment over automation when the cost of a wrong call is high.
- Documented test per trigger. Game-day exercises catch stale or wrong thresholds before production does.
Recover
Recovery is the matching discipline. Documented criteria, auto-recover where safe, human approval where recovery has its own risk.
- Documented recovery criteria. “When X is true for Y minutes, recover.” Removes guesswork at the moment of decision.
- Auto-recover where safe. Encoded recovery on reversible upgrade paths. Returns to full mode without paging.
- Human approval otherwise. Manual gate where recovery itself has risk. Better to wait for confirmation than to bounce back into the same incident.
- Post-recovery monitor. Extended watch after recovery catches mode-flapping early. The first ten minutes after recovery are still incident time.