The Degraded-Mode Runbook

When the system can't fully serve, what's the safe partial mode? The runbook that defines.

Define modes

Degraded modes are the discipline of choosing partial-but-safe over total failure. Each mode has an explicit feature scope so on-call can read the document at 3am and know exactly what is on, what is off, and what the user-visible difference is.

Full mode. All features available. Default operational posture; what customers expect to see.
Degraded mode. Read-only or feature-shedding state. Core function preserved under stress at the cost of writes or non-essential features.
Minimal mode. Cached responses only. Last line before total failure; preserves the page-loads-but-actions-disabled experience.
Documented mode definitions per service. Explicit per-mode feature list lives in the runbook. Catches the “what does degraded actually mean” question at 3am.

Triggers

Triggers are explicit and signal-driven. Auto-trigger where the math is clean; manual escalation where the signal is ambiguous and judgment is required.

Specific signals push into each mode. Latency, error rate, downstream health. Named thresholds per mode.
Auto-trigger where possible. Encoded automation removes human latency at the worst time. The page does not need to fire first.
Manual escalation if needed. IC-driven mode shift when the signal is ambiguous. Judgment over automation when the cost of a wrong call is high.
Documented test per trigger. Game-day exercises catch stale or wrong thresholds before production does.

Recover

Recovery is the matching discipline. Documented criteria, auto-recover where safe, human approval where recovery has its own risk.

Documented recovery criteria. “When X is true for Y minutes, recover.” Removes guesswork at the moment of decision.
Auto-recover where safe. Encoded recovery on reversible upgrade paths. Returns to full mode without paging.
Human approval otherwise. Manual gate where recovery itself has risk. Better to wait for confirmation than to bounce back into the same incident.
Post-recovery monitor. Extended watch after recovery catches mode-flapping early. The first ten minutes after recovery are still incident time.