The Paging Policy That Respects Sleep
Wake the on-call only when you mean it. The four rules that prevent the page-everything-just-in-case habit.
The four rules
Rule 1: only page on customer impact. Internal warnings go to a dashboard, not a phone.
Rule 2: only page when the on-call can act. If they cannot fix it tonight, the page is theatre.
Rule 3: aggregate before paging. 50 alerts on the same incident = 1 page; not 50.
Rule 4: every page must have a runbook. No runbook means the page is too vague to act on.
Audit the page log
Weekly review: of last week's pages, how many were real incidents?
Anything below 70% is a problem. The team is being woken up for noise.
Tighten rules until the rate is above 70%. Loose paging erodes trust and burns out engineers.
Escape valves
Some pages will be wrong. The policy is the average, not the ceiling.
When a page fires for noise, the on-call documents it. The note becomes the input to next week's tuning.
Tuning is everyone's job, not just the SRE team's. Distributed accountability for paging quality.