Designing a Paging Policy That Does Not Burn Out the Team
What pages, when, and to whom is a policy decision that gets made by accident in most teams. Make it on purpose.
The default policy is awful
Every alert pages the on-call. Pretty quickly nobody trusts the pager because half of them are noise, and the half that matter get the same response time as the noise.
The pathology's mechanism. Without a tiered paging policy, every alert is treated as critical by the alerting system; the on-caller experiences this as constant interruption. Their reaction is to mentally downgrade pages — "probably another flap" — which means real critical pages get the slow response that's appropriate for noise.
The cost of the default. Real customer-facing incidents get a 15-minute response when they could have had a 3-minute response. The 12 minutes of delayed response shows up as longer outages, more affected customers, and harder postmortems.
Three tiers
P1 wakes someone up. P2 reaches the on-call during business hours. P3 lands in a queue for the next morning. Three is enough; more tiers blur the boundaries; fewer tiers force everything into "page now" or "ignore."
The simplicity argument. Two-tier policies (page or queue) lose middle-ground granularity; some alerts are urgent but not 3am-urgent. Four-tier policies (P1/P2/P3/P4) blur the lines; what's the difference between P3 and P4? Three tiers fits in working memory and maps to clear behaviors.
The classification discipline. Each alert maps to exactly one tier. The mapping is decided once when the alert is created and reviewed when it fires too often. Without explicit classification, alerts drift toward higher urgency over time as engineers mark "this is important" without considering trade-offs.
P1: page now
Customer-visible outage, security incident, data integrity at risk. The pager rings; the on-call is expected to ack within five minutes regardless of time of day. Limit P1s to under 2-3 a month per service. Above that, your noise is wearing the team down.
The 2-3 limit's rationale. Sustainable on-call assumes infrequent night-time pages. A team with weekly P1s is in a state where the on-call is regularly disrupted; sleep quality degrades; attrition rises. If P1 frequency exceeds the limit, fix the underlying system reliability OR demote some alerts to P2.
The P1 criteria. Customer-visible (real users hitting the issue). Severe (degraded experience or unavailable). Real-time (mitigation can't wait until morning). All three must be true; alerts that fail any criterion shouldn't be P1.
P2: best-effort within hours
Degraded but not down. Page during business hours; queue overnight. The on-call knows about it within hours but does not lose sleep.
The intermediate tier's value. Most alerts that signal real problems aren't 3am-urgent. A storage utilization warning, a capacity threshold, a slow-burning error rate. P2 catches them with fast-enough response without disrupting the team's sleep.
The P2 SLA. Acknowledged within 1 hour during business hours; queued from 18:00-08:00 with no expected response. The 1-hour bound is fast enough to catch most issues before they escalate; the overnight queue acknowledges that not every alert needs night response.
P3: morning
Internal, non-urgent, but worth the on-call's attention next morning. Queues to a ticket. Most operational alerts belong here; very few belong above.
The P3 use case. Routine operational signals: certificate expiry warnings (with weeks of buffer), capacity trending notifications, low-priority bug reports. Each is worth knowing about; none requires interruption.
The discipline of using P3. Engineers want to feel things are urgent; they over-classify. Apply the wake-up test: would you sign off on waking someone up for this at 3am? If no, P3 (or P2 at most). The default for unclear alerts should be the lowest tier; promote only with clear justification.
The wake-up test
For any P1 candidate ask: would you, as a manager, sign off on waking up an engineer at 3am to handle this? If you hesitate, it is not P1. The test sounds harsh; it correctly classifies most alerts.
The test's mechanism. The hesitation reveals doubt. Doubt about a P1 alert means the alert is borderline; borderline alerts should default to P2. This filters out alerts that are technically important but not urgent.
The leadership angle. The wake-up test is also useful for justifying alert decisions to leadership. "We made this P2 because we couldn't justify the 3am page" is a defensible reasoning; "we made this P1 because someone wanted it" is not.
Escalation path
Primary on-call has 5 minutes to ack. Then secondary. Then manager. Then VP. Each step has a name, not a phone tree. The on-call is not the goalkeeper for the whole org; the escalation is the rest of the org showing up.
The named-step protection. Without explicit escalation, the on-call who's deeply asleep or unreachable is the single point of failure. With escalation, the alert reaches a backup within 5-10 minutes regardless. The escalation chain is what makes on-call coverage real instead of theoretical.
The escalation testing. Once a quarter, send a test page that bypasses the primary (don't ack it). Confirm the secondary, then manager, then VP all receive the page. The testing reveals broken contact info, expired phone numbers, deprecated Slack channels — all the silent failures that show up at 3am if not tested.
Common antipatterns
"Important" as a tier. Some teams add a tier above P1 for "really important" alerts. Now P1 is treated as second-tier; engineers downgrade their P1 response. Stick with three tiers; if P1 isn't enough urgency for something, the alert is poorly designed.
P1 by service rather than impact. "All alerts on the payments service are P1." Some payments alerts are routine (cert renewal warning weeks out). Classify by impact, not service.
The escalation that requires the original on-caller. Secondary on-caller has to call primary to ask "what's going on?" The escalation defeats its own purpose. The escalation chain must be self-sufficient — secondary should be able to investigate independently.
Tiers without clear handoff between business hours and overnight. P2 alert fires at 17:50; primary on-call ends shift at 18:00. Did they handle it or hand off? Define the handoff window explicitly.
What to do this week
Three moves. (1) Audit your last 100 alerts. How many were P1? Apply the wake-up test retroactively to each — would you have signed off on a 3am page? Most teams find 30-50% would fail the test. (2) Demote the failing alerts to P2 or P3. The demotion saves the team's sleep without losing visibility. (3) Test the escalation chain this quarter. The test reveals silent failures that real incidents would also reveal but at higher cost.