Alert Quality Survey
Periodically survey on-call about alert quality.
Why survey on-call
Numeric metrics (page count, MTTA, MTTR) miss the qualitative signal. “This page was correct but pointless” doesn’t show up in any dashboard. On-call engineers are the closest observers of alert quality, and a 5-minute survey at the end of each rotation produces more actionable signal than a quarter of automated metrics.
- Metrics miss qualitative signal. Page count, MTTA, MTTR don’t capture “correct but pointless”.
- On-call closest observers. Cheaper than any tooling investment; the source of the qualitative truth.
- 5-minute survey vs quarter of metrics. The survey produces more actionable signal at lower cost.
- Per-rotation cadence. The survey lives at the rotation level; supports the recurring discipline.
What to ask
Five questions, no more. “Did any page wake you up unnecessarily? Which alert is the worst? Which alert was missing? Did your runbooks work? Was your secondary supportive?” Use a numeric scale (1-5) for each because trends matter more than individual values; free-text follow-up “What’s one alert you would delete?” finds a clear winner within 3 rotations.
- Five questions ceiling. No more; the response rate stays high when the survey is short.
- 1-5 numeric scale. Trends matter more than individual values; the scale supports plotting.
- Free-text delete-list. “What’s one alert you would delete?”; clear winner emerges within 3 rotations.
- Per-question stable wording. The questions stay the same across rotations; supports trend comparison.
How to run it
Operationally the survey runs lightly. Trigger automatically at end of shift via PagerDuty webhook into a Google Form or Notion DB; anonymise responses by default because engineers will be honest about “this team’s alerts are noise” only if they cannot be tracked; aggregate weekly and share the rolling 4-week trend in the on-call retro.
- Auto-trigger via webhook. PagerDuty webhook into Google Form or Notion DB; no manual reminders.
- Anonymise responses. Engineers honest about noise only if untraceable; the discipline requires it.
- Weekly aggregation. Rolling 4-week trend shared in the on-call retro; supports continued attention.
- Per-week visibility. The trend is visible at the team level; supports the qualitative awareness.
Acting on results
Surveys without action breed cynicism. Tie survey results to a tuning budget so each shift’s worst alert gets 4 engineering hours of investigation that quarter; review delete-list responses because the most-mentioned alerts are usually deletable safely; don’t let surveys become paperwork because response rate drops to zero in 6 weeks if results don’t drive action.
- Tuning budget tie. Each shift’s worst alert gets 4 engineering hours of investigation per quarter.
- Delete-list review. Most-mentioned alerts usually deletable safely; trust the on-call.
- Avoid paperwork trap. Response rate drops to zero in 6 weeks if results don’t drive action.
- Per-cycle visible action. One visible improvement per cycle; signals the survey is worth filling out.
Get started
Start small. Run the survey for one rotation cycle (2 weeks) without waiting for tooling; pick one action per response cycle because small visible improvements signal the survey is worth filling out; publish the trend monthly because the visibility itself drives improvement.
- One rotation cycle first. 2 weeks; don’t wait for tooling; the simplest form works.
- One action per cycle. Small, visible improvements; signals the survey matters.
- Monthly trend publication. Visibility drives improvement; the trend is the lever.
- Per-quarter survey review. Survey itself reviewed for fit; supports continuous improvement.