Alert Clarity Test

Test alert text against the 'phone test.'

The phone test

The clarity test is concrete. Read the alert text aloud as if it just woke you up at 2am; if the next action isn’t obvious in 30 seconds, the alert fails. Bad: “HighCPU on host i-0a3f”. Good: “API checkout p99 above 500ms for 10m, runbook at /runbooks/checkout-latency, dashboard at /d/checkout”. Every alert should pass before it ships.

Required fields

Five fields, every alert. Service name, environment, severity, what failed, what to check first; runbook URL is mandatory not optional because a page without a runbook is a page that wakes someone up to read code at 2am; dashboard link with the time range pre-set to the alert window, PromQL link if applicable.

Common failures

Three failures show up over and over. Alert names that contain only metric paths (the reader cannot infer impact from kube_pod_container_status_waiting_reason); generic descriptions copied across 40 rules (“something went wrong with the system” provides nothing); severity labels that don’t match the receiver (a sev1 routed to Slack is a documentation bug).

Templates that work

Templates make clarity automatic. Use Prometheus templating: {{ $labels.service }} {{ $labels.env }} burn rate {{ $value | humanize }} over {{ .Window }} with the runbook URL in annotations; PagerDuty custom details should mirror the dashboard so responders don’t need to translate; test rendering with promtool render or Datadog’s preview tab before merging.

Audit the existing catalog

The audit is concrete. Pick 10 random alerts from the last 30 days and read them at 2am cold; score each as pass or fail; if the failure rate is above 30%, freeze new alert creation until the existing catalog is rewritten; skip the test only for low-tier email notifications because those still benefit but aren’t blocking.