Alert Volume → Burnout Correlation
Studies show alert volume correlates with burnout. The data.
What the data shows
The correlation is well-documented. PagerDuty’s State of Digital Operations and Catchpoint’s SRE survey show direct correlation between page volume and on-call attrition (above 5 pages per night, retention drops sharply); internal Google SRE data published in the SRE Workbook shows teams above 2 incidents per shift report higher burnout scores; burnout is lagging because by the time the engineer quits, the team has tolerated noisy alerting for 6-12 months.
- 5+ pages per night drops retention. PagerDuty and Catchpoint surveys agree; the threshold is real.
- 2+ incidents per shift = burnout. Google SRE Workbook data; the upper limit.
- Burnout is lagging. Engineer quits 6-12 months after noisy alerting tolerated.
- Per-rotation health metric. Each rotation tracked; supports early intervention.
The mechanism
Three mechanisms drive the correlation. Sleep interruption is dominant (a single overnight page reduces next-day cognitive performance by 20%, three pages destroy the day); cognitive load compounds (page-shifted engineers cannot deep-focus the following day, which delays platform work, which produces more pages); predictability matters more than volume (10 expected pages on Tuesday is less damaging than 3 random pages over a week).
- Sleep interruption dominant. One page = 20% next-day cognitive hit; three destroys the day.
- Cognitive load compounds. Page-shifted engineers can’t deep-focus; delayed platform work produces more pages.
- Predictability over volume. 10 expected on Tuesday beats 3 random over a week.
- Per-shift health awareness. Predictability built into rotation design.
What to track
Three metrics surface burnout risk. Pages per shift, p95 across the rotation (target under 2 per night); after-hours pages as a ratio of business-hours pages (above 30% suggests poor batching or genuine underinvestment in resilience); on-call survey every quarter with two questions: “Was your sleep affected?” and “Could you have prevented the page?” (bin and trend).
- Pages per shift p95. Target under 2 per night; the headline metric.
- After-hours ratio. Above 30% means poor batching or underinvestment in resilience.
- Quarterly survey two questions. Sleep affected; preventable page; bin and trend.
- Per-quarter trend dashboard. Visible to the team; supports continued attention.
Interventions that work
Three interventions reliably reduce burnout. Quarterly noise audits with a fixed budget for tuning (make tuning a tracked metric, not background work); pay for on-call (even modest stipends signal time is valued, the cultural effect is larger than the dollar amount); spread the load (two-week rotations are easier to absorb than week-long ones; primary plus secondary plus manager-on-call halves the load on each).
- Quarterly noise audits. Fixed tuning budget; tracked metric, not background work.
- Pay for on-call. Modest stipends signal value; cultural effect larger than dollars.
- Spread the load. Two-week rotations; primary plus secondary plus manager-on-call.
- Per-intervention measurement. Each intervention’s effect tracked; supports continued investment.
Apply this quarter
The application is concrete. Pull last quarter of pages from PagerDuty and compute median, p95, after-hours ratio per rotation; if any rotation exceeds 5 pages per shift p95, freeze feature work on the owning service until the noise budget is back under target; run the burnout survey once because the first run sets a baseline and subsequent runs measure intervention effectiveness.
- Quarter of pages from PagerDuty. Median, p95, after-hours ratio per rotation.
- 5+ p95 freezes feature work. Until noise budget back under target.
- Burnout survey baseline. First run sets baseline; subsequent measure intervention.
- Per-quarter intervention cycle. Continued reduction; supports the discipline.