Alert Summary vs Detail
Alerts should summarise; detail is one click away.
The pattern
An alert payload should fit on a phone screen: service name, customer impact, severity, runbook link. Detail (full stack traces, dashboards, raw metric values) belongs one click away, not in the page itself. The on-call reads the summary at 3am, decides to engage or not, and opens details once at a laptop.
- Phone-screen budget. Service, customer impact, severity, runbook link; that’s it.
- Detail one click away. Stack traces, dashboards, raw metric values not in the page itself.
- Read at 3am, decide to engage. The summary is the engage-or-not decision surface.
- Detail at the laptop. The on-call opens the linked dashboard once they’re ready to investigate.
What a good summary looks like
The good summary is structured. “checkout-api: p99 latency 1.2s, SLO 200ms, 4% error rate spike. Started 14:32 UTC. Runbook: <link>”. Five clauses (service, what, by how much, when, where to look), one line each, no emoji or decorative text or apologies because the on-call needs information not tone.
- Five-clause structure. Service, what, by how much, when, where to look; one line each.
- Concrete example. “checkout-api: p99 latency 1.2s, SLO 200ms, 4% error rate spike. Started 14:32 UTC. Runbook”.
- No emoji or decoration. Information, not tone; the on-call needs to act, not be entertained.
- Per-template enforcement. The five-clause shape committed to the alert template; supports consistency.
What good detail looks like
Detail is pre-rendered links. Linked dashboard with the relevant time range pre-selected (Datadog and Grafana support this via URL parameters); recent deploys (Argo CD events, GitHub Actions runs) so the on-call knows if a change preceded the alert; top affected endpoints, top customers, current load, all derivable from APM data and pre-rendered into the link.
- Linked dashboard with time range. Datadog and Grafana support URL parameters for time range and filters.
- Recent deploys included. Argo CD events, GitHub Actions runs; the on-call needs to know if a change preceded.
- Top-N derivable. Top endpoints, top customers, current load; pre-render from APM into the link.
- Per-link freshness. The link parameters update with the alert time; the dashboard opens to the right context.
Anti-patterns
Three anti-patterns survive too long. Alerts that paste 200 lines of stack trace into the payload (mobile clients truncate, hiding the actual error); alerts that say “see Datadog” without a deep link (forces 5 manual steps at 3am); alerts with 12 fields all in the same priority (the eye doesn’t know where to land).
- Stack trace in payload. 200 lines pasted; mobile clients truncate; actual error is hidden.
- “See Datadog” without deep link. Forces 5 manual steps at 3am.
- 12 equal-priority fields. Eye doesn’t know where to land; the summary fails.
- Per-anti-pattern lint. CI catches the common anti-patterns; the discipline lives in the linter.
Apply this week
The application is concrete. Pick your 3 most-paged alerts and rewrite each summary to fit one phone screen; move detail to a linked dashboard with pre-set time range and filters; test on a phone (not a monitor) because the page is read on a phone first, every time.
- Pick top 3 most-paged. Rewrite summaries to fit one phone screen; the highest-leverage change.
- Move detail to dashboard. Pre-set time range and filters; the link does the work.
- Test on phone. The page is read on a phone first; the laptop view is secondary.
- Per-week three-alert cycle. Three alerts per week; supports a steady migration to the new pattern.