The Degraded-Mode Recovery Runbook

Recovering from degraded mode is its own runbook. The steps that prevent re-degradation.

Verify root cause fixed

The first recovery step is verification, not restoration. Recovery on top of an unfixed root cause re-fails immediately and burns customer trust twice in one incident. Strict gate: do not start the recovery sequence until a named engineer confirms the cause is actually fixed.

Strict gate: cause fixed before recovery. Recovery before fix produces false hope and a second incident inside the first. Hold the line.
Mandatory verification step. Documented "we confirmed X is no longer true" check. Catches premature recovery attempts.
Named verifier per incident. Responsible engineer signs off. Catches "I assumed someone else checked" gaps.
Verification commands in the runbook. Explicit "run X and expect Y" steps. Supports new responders who do not yet know the system intuitively.

Staged recovery

Recovery is staged, not all-at-once. Restore one feature, watch metrics for a defined window, promote the next. Staging catches the partial-failure modes that come back broken even after the underlying cause is fixed.

One feature at a time. Feature-by-feature progression. Each stage gets controlled validation before the next starts.
Observation window between stages. Watch metrics for the documented window before promoting. Catches the second-order failures.
Detect partial-failure modes. Some features come back wrong even with the cause fixed. Staging surfaces them while the blast radius is still small.
Rollback option per stage. Documented retreat for every stage. Catches "we are committed now" pressure to push through visible regressions.

Comms

Comms during recovery mirror the staged restoration. Customers see incremental improvement rather than a single all-clear that they cannot verify against their own experience. Per-stage status updates, an explicit final all-clear, named comms author for continuity through long recoveries.

Status update per recovery stage. Each stage gets a published progress note. Customers see incremental improvement.
Visible feature restoration. Each update names the feature now restored. Builds customer confidence stage by stage.
Final all-clear message. Explicit completion entry once every stage is verified. Catches lingering customer uncertainty about whether the incident is really over.
Named comms author per recovery. Responsible writer for continuity. Long recoveries especially need one voice.