Escalation Policies That Don't Drop Incidents
Most escalation policies fail in the same place: when the primary doesn't ack, the policy assumes someone else will. Without the second name attached, no one shows up.
How incidents get dropped
The primary on-call's phone is on do-not-disturb. The secondary's name is "TBD". The manager's pager isn't on the rotation. The page goes nowhere; the incident accumulates damage; someone notices three hours later in Slack. By then a 15-minute incident has become a half-day customer-trust problem.
Every team thinks "this won't happen to us" until it does. The dropped incident is almost always preceded by a silent failure mode that the team didn't know existed: a vacation override that wasn't cleared when the engineer returned, a phone-number change that wasn't pushed to the alerting tool, a Slack notification setting on the on-call channel that defaulted to "don't disturb." Each is a five-minute fix, invisible until it's the cause of a major outage.
The mental model that helps: assume your escalation policy will fail in some way you haven't predicted. Design for failure detection, not perfect operation. Quarterly tested escalations (the 2am test page) are the single highest-leverage practice for catching silent failures.
The three-step pattern
- Primary on-call paged. 5 minutes to ack.
- Secondary on-call paged. 5 more minutes.
- Engineering manager paged. From here it climbs the management chain by name, every 10 minutes.
Three steps is enough; more steps blur the urgency. A five-step escalation might feel safer ("we have backup for the backup") but in practice each additional step lengthens the worst-case time-to-human and creates more places for silent failures to live. Three steps with rigorous testing beats five steps with hope.
The reason for the role names (primary, secondary, manager) rather than person names is durability. People go on vacation, change roles, leave the company. Roles persist. The escalation chain that says "page the EM of the affected team's parent group" continues to work after every reorg; the chain that says "page Anand" stops working the moment Anand changes teams.
Timing
5 minutes for primary, 5 for secondary, 10 for manager. Tighter is hostile; looser is dropped. The 5-minute window is short enough that an awake person will respond and long enough that a deeply asleep person can wake up to the second buzz.
The 10-minute manager step looks longer but reflects reality. By the time the chain reaches a manager, both on-callers have failed to ack, which is a serious enough signal that the manager will treat it with full focus. Padding the time prevents the manager from being roused for what would have been a fast-resolve had the second on-caller answered 30 seconds later.
What about SEV1 versus SEV3? Use the same escalation timing for all severities. Different timings per severity create complexity that backfires under stress; the on-caller has to remember which version of the policy applies, and at 3am they'll guess wrong. Same chain, same intervals; severity affects who gets paged in addition to the on-call, not how the chain runs.
Silent failure modes
The most common: a name in the rotation has left the team and was never removed. A vacation override that has expired but was never resumed. Notification preferences (a phone in DND mode that the alerting tool doesn't bypass). Each one bites teams using the policy when they're under stress.
Other common silent failures. (1) The on-call schedule has a gap because someone forgot to pick up after their shift. PagerDuty/OpsGenie show this as "no one assigned" if you look, but the team didn't look. (2) The phone-tree integration broke after a vendor update; pages still go out, just slower. (3) The escalation policy refers to a Slack channel that was archived. The page lands in /dev/null. (4) International team members have phone numbers that the alerting provider doesn't reach reliably.
The audit cadence. Quarterly, walk through every escalation policy: who's on it, are their contact details current, is the rotation schedule covered for the next 90 days, are there gaps. The 30-minute exercise eliminates 80% of silent failures before they bite.
Testing the policy
Once a quarter, send a real test page at 2am. The page bypasses the chain only if no one acknowledges within the budgeted total. The exercise reveals every silent failure mode in one round.
How to do this without burning out the team. Pick a quiet weeknight (Tuesday is traditional). Inform the team in advance that a test page is coming THIS QUARTER but not which night. The on-call sees the page, ack's it, posts in the channel "test acknowledged", total impact: 3 minutes of sleep disruption per quarter for one engineer.
What you measure. Time-to-ack for primary, secondary, manager. If primary acks in 3 minutes, healthy. If primary doesn't ack and secondary does at minute 6, the chain works. If neither acks and manager does at minute 11, your primary/secondary rotation has a problem. If nobody acks within 20 minutes, you have a serious silent failure to fix.
The tooling layer
Most teams use PagerDuty, Opsgenie, Incident.io, or a similar service. The mistake to avoid: writing your escalation logic in the alerting service AND in a custom Slack bot AND in a CI hook. Three sources of truth means at least one is wrong.
Pick one tool as the source of truth and route everything through it. The other systems consult it; they don't duplicate the logic. The day someone changes the rotation, they change it in one place and every downstream system reflects the change automatically.
What to put in the on-call's "first action" doc, regardless of tooling. The link to the production status dashboard. The link to the runbooks for the affected service. The Slack channel where to declare an incident. The escalation policy in human-readable form (so the on-caller knows who to summon if the situation needs it). Three links + a paragraph; pinned to the on-call channel.
Antipatterns
"Page everyone all at once." Tempting at first ("more eyes is better!"); guaranteed to produce three engineers debugging the same symptom in parallel and zero coordination. The escalation chain is sequential by design.
"The CTO has the pager." Common in early-stage startups. Works for a quarter; produces an exhausted CTO and a team that hasn't built the on-call muscle. Move to rotating on-call by 8-10 engineers.
"Different escalation chains for different services and different severities." Looks responsive; produces a 47-row policy table that nobody can navigate at 3am. One chain per service, the same shape across services if possible. Severity affects who's added in parallel, not the chain itself.
"The manager is on the rotation as primary." Managers are bad on-callers because they're in meetings during the day. They should be in the escalation chain (manager step at minute 10) but not on the primary rotation.
What to do this week
Three moves. (1) Print your current escalation policy. Walk a teammate through it; if they get confused, the policy is too complex. (2) Audit phone numbers and contact methods for every person on every rotation. About 10% will be stale on the first audit. (3) Schedule a 2am test page for next quarter. Put it on the engineering calendar with a note: "tests will happen sometime this quarter, don't ignore your pager." The act of scheduling raises team awareness even before the test fires.