Alert Quality Survey

Periodically survey on-call about alert quality.

Why survey on-call

Numeric metrics (page count, MTTA, MTTR) miss the qualitative signal. “This page was correct but pointless” doesn’t show up in any dashboard. On-call engineers are the closest observers of alert quality, and a 5-minute survey at the end of each rotation produces more actionable signal than a quarter of automated metrics.

What to ask

Five questions, no more. “Did any page wake you up unnecessarily? Which alert is the worst? Which alert was missing? Did your runbooks work? Was your secondary supportive?” Use a numeric scale (1-5) for each because trends matter more than individual values; free-text follow-up “What’s one alert you would delete?” finds a clear winner within 3 rotations.

How to run it

Operationally the survey runs lightly. Trigger automatically at end of shift via PagerDuty webhook into a Google Form or Notion DB; anonymise responses by default because engineers will be honest about “this team’s alerts are noise” only if they cannot be tracked; aggregate weekly and share the rolling 4-week trend in the on-call retro.

Acting on results

Surveys without action breed cynicism. Tie survey results to a tuning budget so each shift’s worst alert gets 4 engineering hours of investigation that quarter; review delete-list responses because the most-mentioned alerts are usually deletable safely; don’t let surveys become paperwork because response rate drops to zero in 6 weeks if results don’t drive action.

Get started

Start small. Run the survey for one rotation cycle (2 weeks) without waiting for tooling; pick one action per response cycle because small visible improvements signal the survey is worth filling out; publish the trend monthly because the visibility itself drives improvement.