Alert Design From Zero
Designing alerts from scratch. The five questions to answer before any alert ships.
The five questions before any alert ships
Five questions decide whether an alert is ready for paging or still belongs on a dashboard. Anything missing fails the gate.
- Who pages on this. Owner team, escalation policy, and runbook URL belong in the alert metadata. Unclear ownership is unfinished work.
- First 5-minute action. What does the responder do? No concrete first step means the alert is informational and belongs on a dashboard, not in PagerDuty.
- Customer impact. If the alert fires without users noticing, raise the threshold or downgrade to a ticket.
- Reversibility plus runbook stub. Is the response reversible if it turns out to be a false alert, and is there at least a stub runbook a new on-call could follow at 3am?
Symptom alerts beat cause alerts
Page on what users feel, not on what your servers feel. Cause alerts produce noise; symptom alerts fire only when it matters.
- User-visible symptoms. p99 latency over 500ms, error rate above 2 percent, checkout success rate dropping below 99 percent. These map directly to SLOs.
- Why cause alerts hurt. CPU at 90 percent or queue depth growing. The system can absorb hot CPU without users noticing; cause alerts page on absorption.
- Cause as diagnostic. Keep cause-level signals as dashboards or low-priority tickets. They become diagnostic context once the symptom alert pages.
- Tie-back link. Each symptom alert links to the cause-level dashboards a responder will need within the first minute.
Pick thresholds from real data
Thresholds belong in data, not in intuition. Three steps anchor the threshold to reality and prevent the gut-feel problem.
- 30-day percentile. Pull 30 days of the metric. Set the threshold at the 99th percentile of normal operation, not a round number that felt right.
- Backtest validation. How many times would this alert have fired last month? More than once a week means the threshold is too tight.
- Burn-rate over fixed. Burn-rate alerts on SLOs are sharper. A 14.4x burn over 1 hour catches fast outages; a 1x rate over 6 hours catches slow ones.
- Re-validate quarterly. Traffic grows, services scale, distributions shift. Last quarter’s threshold may be loose or tight today.
Required alert metadata
The metadata is what converts a fire into actionable information. Without it, the page wastes responder time even when the alert is correct.
- Standard fields. Title, summary, runbook URL, owner team, dashboard link, severity, and the query that triggered it.
- Template enforcement. Bake the fields into the Alertmanager or Datadog monitor template. Reject alerts in code review that lack a runbook link.
- Recent deploys. Include the last 3 deploys to the relevant service. Half of all incidents trace to a recent deploy; surfacing that saves the responder 10 minutes.
- Customer-impact line. One sentence the responder can repeat to leadership without rephrasing. The blast-radius framing matters.
How to ship an alert this week
The discipline ships in three concrete steps. Each one is independently useful and the trio compounds.
- Start from SLO. Define one SLO per critical service. Availability and latency, set a 99.9 percent target, configure multi-window burn-rate alerts.
- Shadow for 7 days. Log fires to a Slack channel without paging. Tune the threshold based on what fired during the week.
- Promote on clean shadow. Move to paging only after the shadow week is clean. Set a quarterly review to retire alerts that have not fired in 90 days.
- Stamp the owner. The owner team is named at promotion time, not later. Unowned alerts at the time of paging are a future incident waiting.