Alerts With Sample Data Included
An alert without context is harder. Include sample data in the alert.
The idea
Alerts that include a sample of the bad data save 5 minutes of triage every time. "Error rate 12%, sample errors:
Sample data answers the second question the on-call always asks: "what is the error?". Skipping that step matters at 3am.
Most alerting tools support this. Datadog log monitors include sample matching logs; Prometheus alertmanager supports template fields.
How to include samples
For log-based alerts, include the top 3 error message strings (deduplicated). Datadog and Splunk both expose this.
For trace-based alerts, include 1 trace ID. Link directly to the trace view; the on-call clicks once.
For metric-based alerts, include the top affected dimensions. "customer-id, region" tells the on-call who is affected.
How much data
Three samples is usually enough. More than 5 is noise; the page becomes unreadable.
Truncate long messages to 200 characters. Stack traces should be linked, not pasted.
Always link to the full data source. Sample is bait; full detail is one click away.
Anti-patterns
Samples that don't match the alert condition. If the alert fires on 5xx but the sample is a 4xx, trust dies.
Samples without timestamps. The on-call wonders if the data is from the current minute or yesterday.
Samples that contain PII or credentials. Mask emails, tokens, and addresses; logs leak.
Apply this week
Pick your 5 most-paged alerts. Add sample data to each.
Verify the sample reaches Slack and SMS clients. Some clients truncate aggressively; test on a real phone.
Mask any field that could contain customer data before it leaves the alerting pipeline.