Alert Quality Survey
Periodically survey on-call about alert quality.
Why survey on-call
Numeric metrics (page count, MTTA, MTTR) miss the qualitative signal. "This page was correct but pointless" doesn't show up in any dashboard.
On-call engineers are the closest observers of alert quality. Asking them is cheaper than any tooling investment.
A 5-minute survey at the end of each rotation produces more actionable signal than a quarter of automated metrics.
What to ask
Five questions, no more. "Did any page wake you up unnecessarily? Which alert is the worst? Which alert was missing? Did your runbooks work? Was your secondary supportive?".
Use a numeric scale (1 to 5) for each. Trends matter more than individual values.
Free-text follow-up: "What's one alert you would delete?". Most teams find a clear winner within 3 rotations.
How to run it
Trigger automatically at end of shift via PagerDuty webhook into a Google Form or Notion DB.
Anonymise responses by default. Engineers will be honest about "this team's alerts are noise" only if they cannot be tracked.
Aggregate weekly. Share the rolling 4-week trend in the on-call retro.
Acting on results
Tie survey results to a tuning budget. Each shift's worst alert gets 4 engineering hours of investigation that quarter.
Review delete-list responses. The most-mentioned alerts are usually deletable safely; trust the on-call.
Don't let surveys become paperwork. If results don't drive action, response rate drops to zero in 6 weeks.
Get started
Run the survey for one rotation cycle (2 weeks). Don't wait for tooling.
Pick one action per response cycle. Small, visible improvements signal the survey is worth filling out.
Publish the trend monthly. The visibility itself drives improvement.