Customer Impact Explicit in Alerts
Alerts should state customer impact plainly.
Why state impact explicitly
An alert that says “high CPU on api-gateway-3” leaves the on-call to derive impact. An alert that says “checkout API p99 1.2s, SLO 200ms, 4% of users affected” is actionable in seconds. Triage time drops by half when customer impact is in the payload because the on-call moves from “is this real?” straight to “how bad?”; stakeholder communication becomes trivial because the impact line copies into Slack.
- Implicit vs explicit impact. “High CPU” vs “p99 1.2s, 4% users affected”; the second is actionable.
- Triage time halves. On-call skips the “is this real?” phase.
- Stakeholder copy-paste. Impact line goes straight into Slack; the translation step disappears.
- Per-alert impact slot. The slot is mandatory in the template; supports consistency.
How to compute impact
The impact computation is alert-class specific. For latency alerts include the affected endpoint, p99 value, SLO, and percentage of requests over budget (source from Prometheus or Datadog APM); for error alerts include error rate, baseline error rate, and the customer-facing surface (“5% of all logins failing”); for capacity alerts include time-to-exhaustion (“disk 85% full, 4 days at current rate” beats “disk 85% full”).
- Latency: endpoint, p99, SLO, % over budget. Source from Prometheus or Datadog APM.
- Error: rate, baseline, customer surface. “5% of logins failing” gives audience.
- Capacity: time-to-exhaustion. “4 days at current rate” beats “85% full”.
- Per-class template. Each alert class has its own impact template; supports consistency.
Templating the alert
Templating makes the discipline automatic. Use Prometheus alert templates with descriptive annotations; Datadog monitor messages support similar templating with {{value}} and tag substitution; standardise the impact line format across the org (“Service: X, Impact: Y, Symptom: Z” is enough, variation costs more than it saves); render impact in the same place every time so the eye lands on customer impact in under 2 seconds.
- Prometheus annotations. Descriptive templates with label substitution.
- Datadog templating.
{{value}}and tag substitution; equivalent capability. - Standard format. “Service: X, Impact: Y, Symptom: Z”; variation costs more than it saves.
- Same render position. Eye lands on impact in under 2 seconds.
When to skip impact
Two narrow exceptions allow skipping. Internal-only alerts where there is no customer surface (background batch jobs, log retention, certificate renewals); synthetic alerts that fire before customer impact is measurable (synthetic itself is the signal). But even synthetic alerts should mention the protected user journey (“synthetic checkout flow failed” is fine, “check failed” is not).
- Internal-only no-customer-surface. Background jobs, log retention, cert renewals.
- Synthetics with implied impact. Pre-customer-visible signals; impact is implicit in the synthetic.
- Even synthetics name the journey. “Synthetic checkout flow failed”; not “check failed”.
- Per-skip documented exception. The exception lives in the alert config; supports later review.
Apply this week
The application is concrete. Audit your top 10 paging alerts and add explicit customer-impact text to each; add a code review check so any new alert without an impact field gets bounced; measure MTTA over the next month because explicit impact typically cuts triage time by 30-50 seconds per page.
- Top 10 paging audit. Add explicit customer-impact text to each.
- Code review check. New alerts without impact field bounced.
- 30-50 second MTTA cut. Per-page; the measurable improvement.
- Per-week alert update cycle. Steady ramp; supports continued discipline.