What on-call means and why it exists
On-call is a rotation in which a named engineer owns production response during a defined window. When an alert fires at 2pm or 2am, the on-call engineer is the human accountable for acknowledging it, figuring out what is wrong, mitigating or fixing it, escalating if it is beyond their reach, and handing off cleanly when their shift ends. It is the human layer underneath all your automation: the answer to the question "who actually does something when the graph turns red?"
On-call exists for a structural reason that no amount of good engineering removes: production runs continuously while teams do not. Customers hit your service at 3am in their time zone, batch jobs fail on weekends, certificates expire on holidays, and a dependency three layers down has an outage during your team offsite. Someone has to own the gap between an alert firing and a human acting on it. On-call is the deliberate, scheduled way to own that gap instead of leaving it to whoever happens to notice.
The difference between healthy and toxic on-call is not whether it exists; it is how much it costs the humans who carry it. A well-run rotation is a quiet pager, a clear escalation path, good runbooks, and a fair amount of compensation for the inconvenience. A badly run one is a constant stream of non-actionable noise, broken sleep, no backup, and a slow march toward attrition. This guide is about building the first kind and, increasingly, about using AI to remove the pages that make on-call painful in the first place.
It helps to separate on-call from two adjacent ideas. On-call is the schedule and the responsibility; incident management is the process you follow once a page turns into a real incident; and AI incident response is what happens when software, not a human, does the first round of triage and remediation. On-call is the foundation the other two sit on, so getting it right pays off everywhere downstream.
The honest framing: on-call is a tax your team pays so customers get reliability around the clock. Your job as a leader is to make that tax as small, as fair, and as predictable as possible. Every page that did not need a human is the tax being levied for no reason.
Rotation models and team-size math
The rotation model is the shape of the schedule: who is on the hook, when, and with what backup. Four patterns cover almost every team, and most mature teams combine them rather than picking just one.
Primary and secondary
The primary is the first responder; the secondary is the backup who catches what the primary misses. If the primary does not acknowledge a page within the timeout, it rolls to the secondary. This pairing is the single most important reliability improvement most teams can make to on-call, because it removes the catastrophic failure mode where one person is asleep, in a tunnel, or simply overwhelmed and a critical page goes unanswered. The secondary also gives the primary psychological cover: knowing someone has your back makes a hard shift far less stressful.
Weekly versus daily
A weekly rotation gives one engineer a continuous week of coverage. It minimizes handoffs and lets the on-call engineer build context, but a bad week is a long week. A daily rotation spreads the load across more people in shorter bursts, which is gentler per person but multiplies the number of handoffs and dilutes context. A common compromise is weekly primary rotations with a separate, lighter secondary schedule, so the heavy context-carrying job rotates slowly while backup duty rotates faster.
Follow-the-sun
Follow-the-sun hands the pager between geographic regions so each engineer only ever covers daytime hours. A team split across, say, North America, Europe, and Asia can pass on-call westward around the clock and nobody is ever woken at night. This is the gold standard for humane on-call, but it requires real engineering presence in at least two and ideally three time zones, plus disciplined handoffs at each boundary. If you have the geographic footprint, follow-the-sun is worth the coordination cost; if you do not, do not fake it by labeling a single-region team as follow-the-sun.
The team-size math
The most important number in on-call design is how often a given person is on the pager. A widely used target is at least six to eight engineers per rotation, so any individual is on-call no more than one week in six to eight. That ratio keeps the burden tolerable and leaves slack for vacation, illness, and parental leave without the schedule collapsing.
- Eight or more engineers: comfortable. One week in eight or better, with room for a secondary tier and follow-the-sun if the geography allows.
- Six to seven engineers: workable, but watch for the schedule tightening every time someone takes leave.
- Four to five engineers: strained. Someone is always on-call, just off it, or about to go on. Treat this as a temporary state, not a steady one.
- Fewer than four engineers: a retention risk. The right move is not to grind a small team harder but to cut page volume at the source with automation and AI, then add humans deliberately.
The trap teams fall into is treating on-call as a fixed cost and trying to spread it across whatever headcount they have. The better mental model is that page volume is the variable you control. If the math says you need eight engineers but you have five, the answer is usually to make the pager quieter, not to make five people miserable. That is exactly where the AI section below comes in.
See how Nova auto-resolves routine incidents before they ever reach a human pager.
Explore the platform →Escalation policies that never drop a page
An escalation policy is the rule set that decides who gets paged, in what order, and how long the system waits before moving on. It is the safety net that guarantees no alert can vanish into the void because one person did not see it. A good escalation policy is boring, explicit, and impossible to fall through.
Acknowledgment timeouts
Every escalation step needs a timeout: the window in which the current person must acknowledge before the page rolls onward. A common pattern is a short first timeout, around five minutes for high-severity pages, so a sleeping or unavailable primary does not delay response for long. Lower-severity pages can use longer timeouts. The number is a tradeoff: too short and you wake the secondary needlessly; too long and a real incident festers while the primary is unreachable.
Tiers and severity
Escalation should be tiered by severity, not one-size-fits-all. A SEV-1 customer-facing outage might page primary, then secondary after five minutes, then the team lead, then a manager and an incident commander, with a war room opening automatically. A SEV-3 degraded-but-working issue might page only the primary during business hours and simply create a ticket overnight. Matching the urgency of the escalation to the urgency of the problem is how you protect sleep without protecting it at the cost of real incidents.
Fallbacks: the guarantee
Every escalation chain must end in a guaranteed fallback so no page can fall through the cracks. If the primary, secondary, and lead all fail to acknowledge, the page must reach someone, whether a wider team channel, an always-staffed operations center, or a manager phone tree. The single worst on-call failure is a critical alert that paged into silence because the chain ran out of people. Designing the last link to be unmissable is non-negotiable.
| Step | Who | Typical timeout | Purpose |
|---|---|---|---|
| 1 | Primary on-call | 5 min | First responder for the service |
| 2 | Secondary on-call | 5 min | Backup if primary misses the page |
| 3 | Team lead | 10 min | Pull in ownership and authority |
| 4 | Manager plus incident commander | immediate | Coordinate a major incident |
| 5 | Wide fallback channel | guaranteed | The unmissable last link |
Escalation policies are also where AI earns its place early: if software acknowledges and resolves routine pages within seconds, the human escalation chain only ever fires for the incidents that genuinely need a person, which is exactly what it was designed for.
Humane on-call: compensation and burnout
On-call is the leading cause of burnout and attrition in reliability roles, and it is almost always preventable. Humane on-call rests on four protections, and skipping any one of them is how good engineers quietly decide to leave.
Compensate it explicitly
On-call is work, and work outside normal hours deserves explicit recognition. That can mean on-call pay, time off in lieu of the hours carried, or a reduced delivery load during and after a rotation. The exact mechanism matters less than the principle: pretending on-call is free is how teams end up resented and depleted. When people feel fairly compensated for the inconvenience, they tolerate it; when they feel it is an invisible tax, they burn out.
Cap page volume and protect sleep
The fastest path to attrition is broken sleep. A rotation that routinely wakes someone several times a night is not a tough assignment; it is a broken system. Set an explicit ceiling on after-hours pages and treat any night that exceeds it as a defect to be fixed in the next sprint, the same way you would treat a recurring production bug. Repeated night pages for the same cause should generate a fix, never just another page. This is the discipline that separates teams that survive on-call from teams that are slowly ground down by it.
Protect recovery
After a hard shift, an engineer needs time to recover, not a full sprint of feature work due the next morning. Build recovery into the schedule: lighter delivery expectations during on-call weeks, and explicit permission to rest after a brutal night. A team that punishes a rough on-call week with an unchanged delivery deadline is teaching its people that reliability work is something to avoid.
Kill the noise at the source
The deepest fix for on-call burnout is also the simplest to state: reduce the number of pages. Most of the pain comes from repetitive, non-actionable, or auto-resolvable alerts that never needed a human at all. Every one of those you eliminate, through better alerting, runbook automation, or AI auto-resolution, is a page that will not interrupt dinner or sleep. The humane endgame, covered below, is a pager that only ever fires for genuine judgment calls.
Attrition math: replacing one senior reliability engineer who quits over on-call costs many months of recruiting, ramp-up, and lost context. Almost any investment that meaningfully reduces page volume pays for itself the first time it prevents a resignation.
Runbooks and clean shift handoffs
Two practices do more to make on-call survivable than any tool: high-quality runbooks and disciplined shift handoffs. Both attack the same enemy, dependence on a single person's memory.
Runbook quality
A runbook turns tribal knowledge into a step-by-step procedure any on-call engineer can follow at 3am without paging an expert. The test of a good runbook is simple: could the newest member of the rotation, half-asleep, follow it to a safe resolution? That means concrete commands, expected outputs, decision points spelled out, and a clear "if this does not work, escalate to X" exit. Vague runbooks that assume deep context are worse than none, because they create false confidence. Runbooks should be living documents, updated after every incident that revealed a gap.
Runbook quality is also the precondition for AI auto-resolution. An automated remediation can only execute a procedure that has been made explicit. The discipline of writing a clear runbook for your top incident types is the same discipline that lets software run those runbooks for you, which is the bridge from humane on-call to mostly-automated on-call.
The shift handoff
The handoff is the moment the outgoing engineer transfers context to the incoming one: open incidents, flaky services, in-flight changes, and anything that is "watch out for this." A skipped or sloppy handoff is how an incident silently regresses between shifts, because the new on-call has no idea a fragile fix is holding something together. A good handoff is a short, structured ritual: a written summary of the state of the world plus a quick verbal sync for anything sensitive. It costs ten minutes and prevents the class of incident where nobody realized a problem had been handed to them.
Reducing tribal knowledge
The strategic goal behind both practices is to stop reliability from depending on heroes. When the only person who knows how to fix the payments service is one senior engineer, that person can never truly be off-call, and the team is one resignation away from a crisis. Runbooks, recorded handoffs, and AI that has ingested both are how you spread that knowledge across the system so no single human is a single point of failure. This connects directly to root cause analysis: every postmortem that produces a runbook entry is tribal knowledge being converted into shared, durable, and ultimately automatable procedure.
On-call metrics worth tracking
You cannot improve on-call you do not measure, and most teams measure the wrong things. Track these five, and be honest about what they tell you.
1 Page volume per rotation
How many pages a single on-call shift receives. The headline number for sustainability. If it is climbing, your on-call is getting harder regardless of how good your people are.
2 Mean time to acknowledge
How long from a page firing to a human acknowledging it. A rising MTTA hints at fatigue, unreachable responders, or an escalation policy that needs tuning.
3 After-hours pages
The share of pages that fire outside business hours. This is the sleep-and-burnout number. Driving it toward zero is the single biggest humane-on-call win.
4 Toil ratio
How much on-call time is spent on repetitive, manual, automatable work versus genuine problem-solving. High toil is a backlog of automation waiting to be built.
5 Non-actionable rate
The percentage of pages that required no real human action. Every non-actionable page is alert fatigue in numeric form and a candidate for deletion or automation.
6 What to ignore
Skip vanity metrics like total alerts processed. They measure activity, not whether humans are being woken for things that actually need a human.
Read these together, not in isolation. A low MTTA with a high non-actionable rate means your people are heroically fast at handling pages that should never have fired. The destination is a low page volume, a near-zero after-hours rate, and a low toil ratio: on-call that protects reliability without consuming the people who carry it. For the deeper treatment of which reliability numbers matter, see our guide to incident management metrics.
How AI shrinks the on-call burden
Every protection above helps, but they all manage the symptoms of too many pages. AI attacks the cause by removing routine pages entirely, so humans are paged only for genuine judgment calls. This is the shift that changes on-call from a tax to be minimized into a rare, high-value responsibility. It works in three layers.
Auto-triage: one incident instead of twenty pages
When something breaks, it rarely fires a single clean alert. A database slowdown trips latency alarms on every dependent service, and the on-call engineer drowns in a storm of related pages. AI auto-triage correlates those signals across the stack into one enriched incident, suppresses the duplicates, and presents the human, if a human is even needed, with a single coherent picture instead of twenty fragments. The page count drops immediately, and the remaining pages are far easier to act on.
Auto-resolution: the page that never reaches a human
For routine, well-understood incidents, AI can execute the runbook directly within a policy envelope: restart the stuck worker, scale the saturated pool, fail over the unhealthy node, roll back the bad deploy. When software resolves these, the human is never paged at all. This is the heart of the model: a disk-full alert at 3am that an agent clears in seconds is a page that simply does not happen. Agentic platforms like Nova's agentic SRE own this loop end to end, acting within explicit guardrails and recording every action in an audit ledger so the autonomy is always accountable.
Better pages for the incidents that remain
Some incidents genuinely need a human: novel failures, ambiguous tradeoffs, decisions with business context AI should not make alone. For those, AI makes the page better. Instead of a bare "latency high" alert, the on-call engineer receives the likely cause, the relevant logs already pulled, the timeline of what changed, and a suggested remediation to approve or reject. The human is doing the judgment they are uniquely good at, not the grunt work of gathering context at 3am.
Nova's position: the right number of routine pages is zero. Nova's agents detect, triage, and auto-resolve the well-understood incidents across AWS, GCP, Azure, Linux, and Windows, so your on-call rotation exists for the rare genuine judgment call, not the nightly parade of restartable services. The pager gets quiet, and quiet pagers retain engineers.
The compounding effect is what matters. Fewer pages means a smaller team can sustain on-call, which eases the team-size math. A quieter pager means better sleep, which cuts burnout and attrition. And every incident an agent resolves and logs becomes training for the next one, so the system gets quieter over time rather than louder. To see how this fits the broader move from human-first to AI-first operations, read our guide to AI SRE.
A 90-day on-call overhaul plan
You do not fix on-call in a weekend, but 90 days is enough for a real turnaround. The sequence matters: measure first, cut noise second, automate third. Trying to automate before you understand your baseline just builds fast machinery for the wrong problem.
Days 1 to 30: measure honestly
Instrument the five metrics above and establish your real baseline. Most teams are shocked by their after-hours page count and non-actionable rate once they actually look. Audit your escalation policy for gaps, especially a missing guaranteed fallback. Survey the rotation about what actually hurts. You cannot fix what you have not measured, and the act of measuring usually surfaces the two or three alert sources causing most of the pain.
Days 31 to 60: cut the noise
Attack the worst offenders. Tune or delete the alerts driving your non-actionable rate. Write or rewrite runbooks for your top incident types so any on-call engineer can resolve them without paging an expert. Fix the escalation policy: add a secondary tier if you lack one, set sane acknowledgment timeouts, and guarantee the fallback. Formalize the shift handoff as a written ritual. None of this is glamorous, and all of it directly lowers the page count humans see.
Days 61 to 90: automate and make it humane
Now that your runbooks are explicit and your alerts are honest, wire in AI auto-triage so alert storms collapse into single incidents, and AI auto-resolution so routine pages stop reaching humans entirely, within a policy envelope you control. In parallel, formalize humane on-call: explicit compensation, a hard cap on after-hours pages treated as a defect when exceeded, and protected recovery time. By day 90 the pager is meaningfully quieter, the rotation is fairer, and the remaining pages are the ones that actually deserve a human.
- Instrument page volume, mean time to acknowledge, after-hours pages, toil ratio, and the non-actionable rate, and publish the baseline to the team.
- Audit the escalation policy and guarantee an unmissable fallback so no page can ever drop into silence.
- Add a primary-and-secondary tier if you do not have one, with sane per-severity acknowledgment timeouts.
- Identify the top three alert sources by volume and either fix the root cause, tune the threshold, or delete them.
- Write or rewrite a clear, runnable runbook for each of your top incident types, written for a half-asleep newcomer.
- Formalize the shift handoff as a short written ritual covering open incidents, flaky services, and in-flight changes.
- Set an explicit ceiling on after-hours pages and commit to treating any breach as a defect in the next sprint.
- Establish explicit on-call compensation and protected recovery time, and communicate both clearly.
- Deploy AI auto-triage so correlated alerts collapse into single enriched incidents instead of page storms.
- Enable AI auto-resolution for routine incidents within a policy envelope so those pages never reach a human.
Work the list in order and the relief arrives early: noise reduction in the first month buys goodwill, and automation in the third makes the gains durable. The endpoint is an on-call rotation that a six-person team can sustain comfortably, where the pager is quiet enough to be almost forgettable until the rare moment it genuinely needs a human. For a deeper look at the people doing this work, see our guide to the modern AI engineer and how AI observability feeds the signals these systems act on.
Frequently asked questions
What is on-call?
What are the most common on-call rotation models?
How many engineers do you need for a sustainable on-call rotation?
What is an escalation policy?
How do you make on-call humane and prevent burnout?
What is a good page volume per on-call shift?
Why do runbooks and shift handoffs matter for on-call?
What on-call metrics should you track?
How does AI reduce the on-call burden?
How long does an on-call overhaul take?
Related guides
On-call sits at the center of the reliability cluster. These guides go deeper on the practices and platforms that make a quiet pager possible:
- Incident Management: the process you follow once a page becomes a real incident.
- AI Incident Response: how software handles the first round of triage and remediation.
- Self-Healing Infrastructure: systems that remediate routine failures before a human is paged.
- Root Cause Analysis: turning every incident into a runbook entry and a permanent fix.
- Agentic SRE: specialized AI agents owning the full operational loop within a policy envelope.
- AI SRE: the broad shift from human-first to AI-first reliability operations.
- AIOps: the signal-correlation and anomaly-detection layer that feeds auto-triage.
- AI Observability: the telemetry that lets agents understand what is actually happening.
- LLMOps: operating LLM apps in production, for teams whose on-call now covers AI systems.
Make the pager quiet again
Nova's AI agents detect, triage, and auto-resolve routine incidents across your stack, so your on-call rotation exists for the rare genuine judgment call, not the nightly noise. Quiet pagers retain engineers.