Alert Clarity Test
Test alert text against the 'phone test.'
The phone test
The clarity test is concrete. Read the alert text aloud as if it just woke you up at 2am; if the next action isn’t obvious in 30 seconds, the alert fails. Bad: “HighCPU on host i-0a3f”. Good: “API checkout p99 above 500ms for 10m, runbook at /runbooks/checkout-latency, dashboard at /d/checkout”. Every alert should pass before it ships.
- Read aloud at 2am. Next action obvious in 30 seconds; the test bar.
- Bad example. “HighCPU on host i-0a3f”; reader cannot infer impact or action.
- Good example. Service, what, by how much, runbook, dashboard; complete in one read.
- PR review gate. Test runs as part of every new-alert PR; supports the discipline.
Required fields
Five fields, every alert. Service name, environment, severity, what failed, what to check first; runbook URL is mandatory not optional because a page without a runbook is a page that wakes someone up to read code at 2am; dashboard link with the time range pre-set to the alert window, PromQL link if applicable.
- Five-field minimum. Service, env, severity, what failed, what to check first.
- Runbook URL mandatory. Not optional; no-runbook page wakes someone up to read code.
- Dashboard with pre-set time range. The link opens to the right context.
- PromQL link when applicable. The query inspector pre-loaded for fast investigation.
Common failures
Three failures show up over and over. Alert names that contain only metric paths (the reader cannot infer impact from kube_pod_container_status_waiting_reason); generic descriptions copied across 40 rules (“something went wrong with the system” provides nothing); severity labels that don’t match the receiver (a sev1 routed to Slack is a documentation bug).
- Metric-path-only names. Reader cannot infer impact; the alert tells the wrong story.
- Copy-paste descriptions. “Something went wrong” provides nothing; same text across 40 rules is noise.
- Severity-receiver mismatch. Sev1 to Slack is a documentation bug; receiver must match urgency.
- Per-failure lint. CI catches the common failures; the discipline lives in the linter.
Templates that work
Templates make clarity automatic. Use Prometheus templating: {{ $labels.service }} {{ $labels.env }} burn rate {{ $value | humanize }} over {{ .Window }} with the runbook URL in annotations; PagerDuty custom details should mirror the dashboard so responders don’t need to translate; test rendering with promtool render or Datadog’s preview tab before merging.
- Prometheus templating.
{{ $labels.service }} burn rate {{ $value | humanize }}; supports per-alert variation with shared shape. - Runbook URL in annotations. Standard slot; supports automated checks.
- PagerDuty custom details mirror dashboard. Responders don’t need to translate.
- Pre-merge render test.
promtool renderor Datadog preview; catches template breaks before deploy.
Audit the existing catalog
The audit is concrete. Pick 10 random alerts from the last 30 days and read them at 2am cold; score each as pass or fail; if the failure rate is above 30%, freeze new alert creation until the existing catalog is rewritten; skip the test only for low-tier email notifications because those still benefit but aren’t blocking.
- 10 random alerts. Last 30 days; read at 2am cold; score pass or fail.
- 30% failure threshold. Above that, freeze new alert creation until catalog is rewritten.
- Skip for low-tier email. Still benefit; not blocking.
- Per-quarter audit cycle. The audit reruns; supports continued clarity.