Alert Language Clarity
Alerts written under stress. Clarity matters.
The alert title is the first message
The alert title is the first message the responder reads. Title format: [severity] [service] [symptom]. Example: [SEV2] checkout-api: error rate > 2% for 5m. No internal jargon, no team-specific abbreviations because the alert may page someone outside the owning team; under 80 characters because mobile push truncates and the symptom must come before the truncation.
- Title format.
[severity] [service] [symptom]; the standard structure. - No jargon, no abbreviations. Alert may page someone outside the owning team; readability matters.
- Under 80 characters. Mobile push truncates; the symptom must come before the truncation.
- Per-alert title lint. CI checks the format; supports consistent application.
The summary describes the impact
The summary translates the technical condition into a user-impact statement. One sentence (“customer-facing checkouts are failing for X% of users in Y region”), not (“PromQL expression XYZ exceeded 0.02”); the on-call needs to know if it’s a sev1 in 5 seconds. If you cannot describe the impact in plain English, the alert is mis-scoped and should move off paging.
- One-sentence impact. “Checkouts failing for X% in Y region”; the user-facing translation.
- Not technical condition. “PromQL exceeded 0.02” is wrong; the user impact is what matters.
- 5-second sev1 decision. The on-call decides urgency from the summary alone.
- Plain-English test. If you can’t describe impact in plain English, move the alert off paging.
The runbook link is mandatory
The runbook link is the first link in every alert. Titled “Runbook: how to respond to checkout-api error rate”; the runbook opens with how to confirm the alert is real, the first 3 commands to run, the escalation path. If the runbook is a wiki page that has not been edited in 12 months, the alert is undeployable and the alert is blocked until the runbook is current.
- First link in every alert. Titled with action-oriented description.
- Three opening sections. Confirm real, first 3 commands, escalation path.
- 12-month staleness blocks alert. Untouched runbook makes the alert undeployable.
- Per-alert runbook freshness check. CI verifies the runbook age; supports continued accuracy.
Include the data the responder needs
Include the data, not the abstraction. Link to the dashboard pre-filtered to the right time range; link to recent deploys for the affected service; link to the trace search pre-filtered to errors; don’t dump 50 labels into the alert (pick the 5 that distinguish this alert from the others); include the PromQL or Datadog query that fired the alert because the on-call may want to tweak it during triage.
- Pre-filtered dashboard link. Right time range; the on-call clicks once.
- Recent deploys plus trace search. Both pre-filtered for the affected service.
- 5 distinguishing labels. Don’t dump 50; pick the ones that distinguish this alert.
- Inline query. PromQL or Datadog; on-call tweaks during triage.
How to enforce the standard
Three mechanisms enforce clarity. A template in the alerting repo (every alert PR uses the template, no template no merge); linter checks for title format, runbook URL present, dashboard link present, summary length; quarterly audit pulls 10 random alerts, scores them on clarity, and publishes the worst three with the fix in the next sprint.
- Repo template. Every alert PR uses the template; no template, no merge.
- Linter checks. Title format, runbook URL, dashboard link, summary length.
- Quarterly audit. 10 random alerts scored; worst three published with fix in next sprint.
- Per-quarter clarity trend. Documented improvement; supports continued investment.