On-Call Game-Day Rehearsals: Practice for Real Incidents
Game days are how teams stay sharp between real incidents. The cadence + structure determine whether they pay back.
Why game days
Real incidents are infrequent (good); on-call muscle atrophies between them (bad). Game days build the muscle that real incidents need.
- Skill maintenance. The team's response speed depends on practice; without it, response time during real incidents grows.
- Runbook validation. Game days surface stale runbooks before the on-call discovers them at 3am.
- Onboarding tool. New engineers learn the system by participating; reading runbooks alone is not enough.
- Confidence building. Successful drills build the team's confidence to act decisively when the real incident hits.
Four-frequency tier
- Tabletop: walk through a scenario; quarterly.
- Limited blast radius: staging environment; monthly.
- Production game day: controlled production failure; semi-annually.
- Surprise drill: unannounced test; annually.
Scenario library
Game days work better with a curated library. Ten to fifteen scenarios rotated through the year cover most of the surface without becoming repetitive.
- Library size. 10-15 scenarios in rotation; smaller and they get stale, larger and the team never repeats.
- Past-incident match. Scenarios mirror common past incidents; the team builds confidence on familiar territory.
- Difficulty range. Mix easy, medium, and hard scenarios; not every drill is a stress test.
- Annual refresh. Retire scenarios that no longer apply; add new ones from the last year's postmortems.
Action items
The output of every game day is a list of changes that ship. Without follow-through, the drill is theatre.
- Per-drill output. 2 to 5 action items: documentation update, runbook fix, tooling improvement.
- Owner and date. Each item has a named owner and a target date; tracked alongside feature work.
- Closure tracked. Next game day starts from the action-item list of the previous; momentum compounds.
- Public visibility. Action items shared at the engineering all-hands; the team sees the practice produces real change.
Antipatterns
- Game day without action items. Theatre.
- Production game day without rehearsal at lower tiers. High risk.
- One game day annually. Memory fades.
What to do this week
Three moves. (1) Apply this practice to your next on-call rotation. (2) Survey the team after one cycle. (3) Iterate based on feedback; the discipline is the cadence.