Alert Fatigue Survey
Surveying on-call about alert quality. Quarterly.
Why a quarterly survey
Metrics tell you alert volume but not which alerts the on-call hates. A 10-question survey to every on-call once a quarter surfaces the qualitative noise: confusing alerts, useless runbooks, alerts that always fire during deploys. Anonymous, 5 minutes to fill out, run the week after each rotation.
- Volume metrics insufficient. They don’t tell you which alerts the on-call hates.
- 10-question quarterly. Surfaces qualitative noise: confusing alerts, useless runbooks, deploy-time fires.
- Anonymous and short. 5 minutes; honest answers depend on anonymity.
- Per-rotation timing. The week after each rotation; recency matters for accuracy.
The questions that matter
Four core questions cover most ground. How many pages required no action (count, not percentage); which 3 alerts were the most painful and why (free text); did any page have a runbook that actually helped (yes/no per page); on a 1-10 scale, how rested are you after this rotation.
- No-action page count. Count, not percentage; the noise volume.
- Top 3 painful alerts. Free text; the qualitative signal that drives action.
- Runbook helpfulness. Yes/no per page; the runbook quality measure.
- Rest score 1-10. Burnout proxy; below 6 is intervention territory.
Act on the results within 30 days
Action within 30 days keeps the survey alive. Publish a public alert-cleanup list within 2 weeks; top 3 painful alerts get retuned or retired by the next quarter (publish the diff); if the rest score is below 6, reduce the rotation size or pull alerts because burnout is not a metrics problem.
- Public cleanup list in 2 weeks. Survey results without action breeds cynicism.
- Top 3 retuned by next quarter. Publish the diff; the action is visible.
- Rest below 6 triggers intervention. Reduce rotation size or pull alerts; burnout is structural.
- Per-cycle action commitment. One named action per cycle; supports follow-through.
Trend, don't snapshot
One survey is a snapshot; four surveys are a trend. Compare quarter over quarter; track median noise pages per shift, painful-alert count, rest score and plot these against alert volume metrics; trend reversal means the cleanup ritual is working while stable or worsening means you need a bigger intervention.
- Snapshot vs trend. One survey is a snapshot; four are a trend; the multi-quarter view matters.
- Track 3 metrics. Median noise pages, painful-alert count, rest score; plot against volume.
- Trend reversal good. Cleanup ritual is working; the leading indicator.
- Per-trend escalation. Stable or worsening means a bigger intervention; the data drives the response.
Who runs the survey
The owner matters for honesty. On-call manager or SRE lead, not the owning team of the noisy service because that creates conflict-of-interest pressure on responses. 5 minutes per person 4 times a year, 20 minutes per engineer per year total; the ROI is alerts retired, rotations rebalanced, attrition reduced.
- Manager or SRE lead. Not the owning team of the noisy service; conflict of interest distorts responses.
- 20 minutes per engineer per year. 5 minutes per person, 4 times a year; the cost is small.
- ROI: alerts retired, rotations rebalanced, attrition reduced. The cheapest signal you can buy.
- Per-org survey owner. Documented ownership; supports consistent execution.