The Incident Rehearsal Quarterly Cadence
Practice your worst case quarterly. The format, the difficulty curve, and the gaps it surfaces in runbooks and team coordination.
Format
The format is structured: 2-hour block, pre-announced (team knows it is coming but not the scenario), real tools and channels, senior engineer as IC injecting events. Documented learning goals per rehearsal so the drill is testing something specific rather than running through general motions.
- 2-hour block. Bounded time per rehearsal. Pre-announced so the team blocks calendars but does not pre-game the scenario.
- Senior engineer as IC and facilitator. Injects scenarios, observes responses, calls timeouts when needed. Same authority as a real incident.
- Real tools and channels. Same Slack, dashboards, paging, escalation paths as production response. Reveals tool-chain gaps that simulators hide.
- Documented learning goals. Named "what we are testing" objectives per rehearsal. Catches aimless drills that produce no findings.
Scenario examples
Scenarios stretch different muscles. Database failover during peak stresses the cutover sequence; multi-region outage stresses cross-region coordination and IC handoff; vendor API down stresses graceful degradation and customer comms. Rotate scenarios so the same drill does not run twice.
- Database failover during peak. Tests the rollover playbook under realistic load. Stresses the cutover sequence.
- Multi-region outage. Tests cross-region coordination, IC handoff, shared comms. Stresses team-of-teams response.
- Vendor API down. Tests degradation paths and customer comms. Stresses graceful-failure muscle.
- New scenario per quarter. Not-yet-rehearsed shape each time. Catches "we always do the same one" drift.
Output
The output is concrete: gap list, action items with named owners and target dates, drift caught before real incidents hit. Published findings spread the lessons across teams that did not attend the specific drill.
- Gap list per rehearsal. Unclear runbook, broken tool, broken process catalog. Concrete findings, not vague impressions.
- Action items with owners and dates. Named owner and fix-by date per gap. Tracked alongside real-incident action items.
- Drift caught early. Real incidents go better because the drill surfaced the issues first. Pre-emptive maintenance for incident response.
- Published findings. Team-wide write-up. Cross-team learning compounds even when only one team ran the drill.