The On-Call Cool-Down Period
After incidents: cool-down. Reduces secondary errors.
The cool-down protocol
The cool-down protocol is mandatory rest after major incidents: 30-60 minutes of explicit recovery before resuming normal work. Rest reduces secondary errors because a tired on-call making decisions immediately after a sev 1 is at higher risk of compounding the incident. Backup on-call covers the cool-down window.
- Mandatory rest. 30-60 minutes of explicit recovery after major incidents; not optional.
- Reduces secondary errors. A tired on-call making decisions immediately after a sev 1 is at higher compounding risk.
- Backup covers. Not a vacation; an explicit handoff for a bounded window.
- Per-incident protocol invocation. The cool-down is a named protocol with a named trigger, not a vague suggestion.
When to invoke
The cool-down has predictable triggers. Always after sev 1; after long sev 2 incidents; after a string of consecutive sev 3 incidents because cumulative load is itself a fatigue source. The trigger criteria are documented so invocation is automatic, not a debate.
- Sev 1. Always; the highest-stakes incidents drain the on-call regardless of duration.
- Sev 2 over 4 hours. Long durations are draining even at lower severity.
- Consecutive sev 3. Multiple in a shift; the cumulative load is the issue, not any single page.
- Per-trigger documented criteria. The trigger is committed to the runbook; invocation is automatic, not a debate.
How long
Cool-down length scales with severity. 30 minutes minimum, longer for multi-hour sev 1 or customer-facing data issues, up to 2 hours when leadership coordination was required, half-day for catastrophic incidents (data loss, major outages, security events) because the recovery is real and shortcutting it produces secondary incidents.
- 30 minutes minimum. The base recovery window; longer for severe incidents.
- Up to 2 hours. Incidents that involved customer impact or leadership coordination.
- Half-day for catastrophic. Data loss, major outages, security events; the recovery is real.
- Per-severity duration table. The duration mapping documented; supports consistent invocation across the team.
What to do during cool-down
Cool-down is recovery, not reduced-intensity work. Step away from the keyboard, walk, eat, rest, anything except continued incident work; brief debrief with the team is acceptable if it helps process the experience; the postmortem first draft can wait until the on-call is rested.
- Step away from the keyboard. Walk, eat, rest; anything except continued incident work.
- Brief debrief acceptable. Process the experience with teammates if it helps; do not turn it into work.
- Defer the postmortem draft. The on-call writes the timeline later; the analytical work waits until they’re rested.
- Per-cool-down activity guidance. Documented so the on-call doesn’t fall into work by reflex.
Making it stick
Cool-downs only stick if culture and tracking enforce them. Manager enforcement (engineers self-impose poorly), public norm (announce the cool-down, remove stigma), and tracking (cool-downs that aren’t taken get flagged because powering through is risk, not heroism).
- Manager enforcement. Engineers self-impose poorly; managers must require the rest.
- Public norm. Team announces cool-down ("I’m cooling down for an hour after that sev 1"); removes stigma.
- Track usage. Cool-downs that aren’t taken get flagged; powering through is risk, not heroism.
- Per-team cool-down audit. Quarterly review of cool-down adherence; supports the cultural reinforcement.