Alert Summary vs Detail

Alerts should summarise; detail is one click away.

The pattern

An alert payload should fit on a phone screen: service name, customer impact, severity, runbook link. Detail (full stack traces, dashboards, raw metric values) belongs one click away, not in the page itself. The on-call reads the summary at 3am, decides to engage or not, and opens details once at a laptop.

What a good summary looks like

The good summary is structured. “checkout-api: p99 latency 1.2s, SLO 200ms, 4% error rate spike. Started 14:32 UTC. Runbook: <link>”. Five clauses (service, what, by how much, when, where to look), one line each, no emoji or decorative text or apologies because the on-call needs information not tone.

What good detail looks like

Detail is pre-rendered links. Linked dashboard with the relevant time range pre-selected (Datadog and Grafana support this via URL parameters); recent deploys (Argo CD events, GitHub Actions runs) so the on-call knows if a change preceded the alert; top affected endpoints, top customers, current load, all derivable from APM data and pre-rendered into the link.

Anti-patterns

Three anti-patterns survive too long. Alerts that paste 200 lines of stack trace into the payload (mobile clients truncate, hiding the actual error); alerts that say “see Datadog” without a deep link (forces 5 manual steps at 3am); alerts with 12 fields all in the same priority (the eye doesn’t know where to land).

Apply this week

The application is concrete. Pick your 3 most-paged alerts and rewrite each summary to fit one phone screen; move detail to a linked dashboard with pre-set time range and filters; test on a phone (not a monitor) because the page is read on a phone first, every time.