Customer Impact Explicit in Alerts

Alerts should state customer impact plainly.

Why state impact explicitly

An alert that says “high CPU on api-gateway-3” leaves the on-call to derive impact. An alert that says “checkout API p99 1.2s, SLO 200ms, 4% of users affected” is actionable in seconds. Triage time drops by half when customer impact is in the payload because the on-call moves from “is this real?” straight to “how bad?”; stakeholder communication becomes trivial because the impact line copies into Slack.

How to compute impact

The impact computation is alert-class specific. For latency alerts include the affected endpoint, p99 value, SLO, and percentage of requests over budget (source from Prometheus or Datadog APM); for error alerts include error rate, baseline error rate, and the customer-facing surface (“5% of all logins failing”); for capacity alerts include time-to-exhaustion (“disk 85% full, 4 days at current rate” beats “disk 85% full”).

Templating the alert

Templating makes the discipline automatic. Use Prometheus alert templates with descriptive annotations; Datadog monitor messages support similar templating with {{value}} and tag substitution; standardise the impact line format across the org (“Service: X, Impact: Y, Symptom: Z” is enough, variation costs more than it saves); render impact in the same place every time so the eye lands on customer impact in under 2 seconds.

When to skip impact

Two narrow exceptions allow skipping. Internal-only alerts where there is no customer surface (background batch jobs, log retention, certificate renewals); synthetic alerts that fire before customer impact is measurable (synthetic itself is the signal). But even synthetic alerts should mention the protected user journey (“synthetic checkout flow failed” is fine, “check failed” is not).

Apply this week

The application is concrete. Audit your top 10 paging alerts and add explicit customer-impact text to each; add a code review check so any new alert without an impact field gets bounced; measure MTTA over the next month because explicit impact typically cuts triage time by 30-50 seconds per page.