The On-Call Rotation Playbook for Teams of 5,50 Engineers
On-call is the fastest way to burn out a platform team. Here is how to run a rotation that people don't dread.
Shape of a healthy rotation
A healthy rotation is predictable, paired, short, and paid. Unpredictable shifts kill planning. Unpaired shifts burn the primary. Long shifts destroy sleep. Unpaid shifts breed quiet resentment.
If any one of the four is missing, the rotation degrades within a quarter, usually faster than anyone says out loud.
Shift length and primary/secondary
One week is the default. Shorter weeks create too much handoff cost. Longer weeks cause cumulative sleep debt.
- Primary: gets paged first, owns the incident end to end.
- Secondary: backup, also a mental-health safety net, their job is to take over if the primary has back-to-back pages.
For teams under 6 engineers, rotate every two weeks so nobody is back on primary within the same calendar month.
The 10-minute handoff
At shift end, the outgoing primary runs a 10-minute sync with the incoming primary. Template:
- What broke this week (two sentences per incident)
- What is still fragile (monitoring gaps, pending rollbacks, known flaky alerts)
- What would be good to know in the next 24h (planned deploys, traffic events)
Record it. Async follow-ups are fine; the live sync is what makes handoffs feel owned rather than thrown over a wall.
Compensation isn't optional
Some combination of: extra PTO day per week of on-call, 10,15% pay differential for the week, automatic time off the day after a busy night. Pick one and commit. The exact mechanism matters less than that it exists.
Teams that run on-call without compensation lose their best senior engineers first, they are the ones with enough leverage to leave.
Two metrics to watch
Pages per primary per week, and nights interrupted (any page between 11pm and 7am) per month. If either crosses a threshold for two consecutive weeks, something has to give.
Reasonable thresholds for a mature team: under 5 pages/primary/week, under 2 interrupted nights/primary/month. If your numbers are above this, the fix is tuning alerts and SLOs, not adding more people to the rotation.
Teams that run on-call without compensation lose their best senior engineers first. They are the ones with enough leverage to leave.
Rotation health check
Once a quarter, pull four numbers: pages per primary per week, interrupted-nights per primary per month, percentage of pages that led to a real action, and mean acknowledgement time.
If any of those four is drifting in the wrong direction for two consecutive quarters, the fix is tuning alerts and SLOs, not hiring. Teams that hire their way out of on-call pain almost never reverse the alert drift that caused it.
The best signal that a rotation is healthy is boring: engineers volunteer to swap shifts without drama, and nobody feels the need to explain it to HR.