Recovering From a Saturated On-Call
When the on-call has been pinned for 3+ days, normal recovery does not work. The 5-step protocol for getting the team back to baseline.
The 5 steps
Step 1: pause non-critical work. Marketing campaigns, feature launches, anything that adds complexity. Buys cognitive room.
Step 2: bring in additional on-call from another team for 48 hours. Lets the burned-out engineer actually rest.
Step 3: triage the incident backlog. Some items become 'will not fix'; the rest get owners.
Step 4: identify the burning fire. The single thing causing repeat alerts. Fix it before resuming normal cadence.
Step 5: resume normal operation only after a full quiet shift. Premature resumption produces relapse.
Signs you need this protocol
Three or more sleepless nights in a week. Not just busy; actually unable to sleep through.
On-call making mistakes that 24-hour-rest version of them would not.
Stakeholders questioning team capability. Burned-out teams produce visible quality drops.
Avoid
Heroism: 'I can power through.' Powering through is how saturated on-calls become quitting on-calls.
Pretending it is normal. The saturation is data; act on it.
Blame. Saturation usually has a system cause; finding it is more useful than finding fault.