Alerts With Sample Data Included
An alert without context is harder. Include sample data in the alert.
The idea
Alerts that include a sample of the bad data save 5 minutes of triage every time. “Error rate 12%, sample errors: error1, error2, error3” beats “error rate 12%”. Sample data answers the second question the on-call always asks (“what is the error?”); skipping that step matters at 3am. Most alerting tools support templated samples.
- Saves 5 minutes per page. The on-call doesn’t have to query for the error; it’s in the page.
- Answers the second question. “What is the error?”; the immediate follow-up after seeing the rate.
- Tool support. Datadog log monitors include sample matching logs; Prometheus alertmanager supports template fields.
- Per-alert sample contract. The sample shape is documented per alert template; supports consistency.
How to include samples
The shape varies by alert type. Log-based alerts: top 3 error message strings deduplicated (Datadog and Splunk expose this); trace-based alerts: 1 trace ID with a direct link to the trace view; metric-based alerts: top affected dimensions like “customer-id, region” that tell the on-call who is affected.
- Log-based: top 3 strings. Deduplicated; Datadog and Splunk both expose this.
- Trace-based: 1 trace ID. Link directly to the trace view; on-call clicks once.
- Metric-based: top dimensions. “customer-id, region”; tells the on-call who is affected.
- Per-type template. Each alert type has a documented sample shape; supports template generation.
How much data
Sample size has limits. Three samples is usually enough (more than 5 is noise and the page becomes unreadable); truncate long messages to 200 characters with stack traces linked rather than pasted; always link to the full data source because the sample is bait and full detail is one click away.
- Three samples enough. More than 5 is noise; the page becomes unreadable.
- Truncate to 200 chars. Stack traces linked, not pasted; mobile clients truncate anyway.
- Always link to source. Sample is bait; full detail is one click away.
- Per-alert size limit. The sample size cap committed to the template; supports readable pages.
Anti-patterns
Three anti-patterns survive too long. Samples that don’t match the alert condition (alert fires on 5xx, sample is a 4xx, trust dies); samples without timestamps (on-call wonders if the data is current minute or yesterday); samples that contain PII or credentials (mask emails, tokens, addresses because logs leak).
- Mismatched sample. Alert fires on 5xx, sample is 4xx; trust dies.
- No timestamps. On-call wonders if data is from current minute or yesterday.
- PII or credentials. Mask emails, tokens, addresses; logs leak.
- Per-anti-pattern lint. CI catches the common anti-patterns; the discipline lives in the linter.
Apply this week
The application is targeted. Pick your 5 most-paged alerts and add sample data to each; verify the sample reaches Slack and SMS clients (some clients truncate aggressively, test on a real phone); mask any field that could contain customer data before it leaves the alerting pipeline.
- Top 5 most-paged first. Highest leverage; the noisiest alerts get the biggest improvement.
- Test on real phone. Slack and SMS truncate aggressively; verify on the actual surface.
- Mask before pipeline. Customer data masked before it leaves the alerting pipeline; PII never ships.
- Per-week alert update cycle. Five per week; supports steady migration to the new pattern.