The Paging Policy That Respects Sleep
Wake the on-call only when you mean it. The four rules that prevent the page-everything-just-in-case habit.
The four rules
Four rules separate paging that respects sleep from paging that destroys it. Each one is mechanical; the discipline is enforcing them.
- Rule 1: customer impact only. Internal warnings go to a dashboard, not a phone; only customer-facing problems wake humans.
- Rule 2: actionable tonight. Only page when the on-call can act; if it cannot be fixed tonight, the page is theatre.
- Rule 3: aggregate before paging. 50 alerts on the same incident equals 1 page, not 50; the dedup is a feature.
- Rule 4: runbook required. Every page links to a runbook; no runbook means the page is too vague to act on.
Audit the page log
Without a weekly audit, the rules drift. The page log is the cheapest signal that the discipline is holding.
- Weekly review. Of last week's pages, how many were real incidents that needed human action?
- Threshold. Anything below 70% real-incident rate is a problem; the team is being woken up for noise.
- Tighten until honest. Adjust rules until the rate clears 70%; loose paging erodes trust and burns out engineers.
- Owner. One named engineer runs the audit each week; the discipline lives or dies on the cadence.
Escape valves
The rules are averages, not ceilings. Mistakes happen; the policy provides escape valves so noise becomes data, not just frustration.
- Wrong pages happen. Some pages will be noise; the policy is the average across the quarter, not the ceiling per shift.
- Document the noise. When a page fires for noise, the on-call writes it down; the note feeds next week's tuning.
- Distributed accountability. Tuning is everyone's job, not just SRE's; service teams own their alert hygiene.
- Quarterly retro. Aggregate the noise notes; the patterns drive structural fixes, not per-alert tweaks.