Customer Impact Explicit in Alerts
Alerts should state customer impact plainly.
Why state impact explicitly
An alert that says "high CPU on api-gateway-3" leaves the on-call to derive impact. An alert that says "checkout API p99 1.2s, SLO 200ms, 4% of users affected" is actionable in seconds.
Triage time drops by half when the customer impact is in the alert payload. The on-call moves from "is this real?" straight to "how bad?".
Stakeholder communication becomes trivial. The on-call can copy the impact line into Slack and skip the translation step.
How to compute impact
For latency alerts, include the affected endpoint, p99 value, SLO, and percentage of requests over budget. Source from Prometheus or Datadog APM.
For error alerts, include error rate, baseline error rate, and the customer-facing surface ("5% of all logins failing").
For capacity alerts, include time-to-exhaustion. "Disk 85% full, 4 days at current rate" beats "disk 85% full".
Templating the alert
Use Prometheus alert templates with descriptive annotations. Datadog monitor messages support similar templating with `{{value}}` and tag substitution.
Standardise the impact line format across the org. "Service: X, Impact: Y, Symptom: Z" is enough; variation here costs more than it saves.
Render impact in the same place every time. The eye should land on the customer impact in under 2 seconds.
When to skip impact
Internal-only alerts where there is no customer surface. Background batch jobs, log retention, certificate renewals.
Synthetic alerts that fire before customer impact is measurable. The synthetic itself is the signal; impact is implied.
But: even synthetic alerts should mention the protected user journey. "Synthetic checkout flow failed" is fine; "check failed" is not.
Apply this week
Audit your top 10 paging alerts. Add explicit customer-impact text to each.
Add a code review check: any new alert without an impact field gets bounced.
Measure MTTA over the next month. Explicit impact typically cuts triage time by 30 to 50 seconds per page.