Alerts Practical By Samson Tanimawo, PhD Published Sep 23, 2025 4 min read

Alert Clarity Test

Test alert text against the 'phone test.'

The phone test

Read the alert text aloud as if it just woke you up at 2am. If the next action isn't obvious in 30 seconds, the alert fails.

Bad: HighCPU on host i-0a3f. Good: API checkout p99 above 500ms for 10m, runbook at /runbooks/checkout-latency, dashboard at /d/checkout.

Every alert should pass before it ships. Make this part of the PR review for any new rule.

Required fields

Service name, environment, severity, what failed, and what to check first. Five fields, every alert.

Runbook URL is mandatory, not optional. A page without a runbook is a page that wakes someone up to read code at 2am.

Dashboard link with the time range pre-set to the alert window. PromQL link if applicable.

Common failures

Alert names that contain only metric paths. The reader cannot infer impact from kube_pod_container_status_waiting_reason.

Generic descriptions copied across 40 rules. If the description says 'something went wrong with the system,' it provides nothing.

Severity labels that don't match the receiver. A sev1 routed to Slack is a documentation bug.

Templates that work

Use Prometheus templating: {{ $labels.service }} {{ $labels.env }} burn rate {{ $value | humanize }} over {{ .Window }}. The runbook URL goes in annotations.

PagerDuty custom details should mirror the dashboard. Responders should not need to translate.

Test rendering with promtool render or Datadog's preview tab before merging.

Audit the existing catalog

Pick 10 random alerts from the last 30 days and read them at 2am cold. Score each as pass or fail.

If the failure rate is above 30%, freeze new alert creation until the existing catalog is rewritten.

Skip the test only for low-tier notifications that go to email. Those still benefit but aren't blocking.