Alert Design From Zero
Designing alerts from scratch. The five questions to answer before any alert ships.
The five questions before any alert ships
Who pages on this. If the answer is unclear, the alert is not ready. Owner team, escalation policy, and runbook URL belong in the alert metadata.
What action does the responder take in the first 5 minutes. If there is no concrete first step, the alert is informational and belongs on a dashboard, not in PagerDuty.
What customer impact does this represent. If the alert fires without users noticing, raise the threshold or downgrade to a ticket.
Symptom alerts beat cause alerts
Page on user-visible symptoms. p99 latency over 500ms, error rate above 2%, checkout success rate dropping below 99%. These map directly to SLOs.
Cause alerts (CPU at 90%, queue depth growing) generate noise. The system can absorb a hot CPU without users noticing; the symptom alert fires only when it matters.
Keep cause-level signals as dashboards or low-priority tickets. They become diagnostic context once the symptom alert pages.
Pick thresholds from real data
Pull 30 days of the metric. Set the threshold at the 99th percentile of normal operation, not at a round number that felt right.
Validate with a backtest. How many times would this alert have fired last month. If the answer is more than once a week, the threshold is too tight.
Burn-rate alerts on SLOs are sharper than fixed thresholds. A 14.4x burn rate over 1 hour catches fast outages; a 1x rate over 6 hours catches slow ones.
Required alert metadata
Every alert ships with: title, summary, runbook URL, owner team, dashboard link, severity, and the query that triggered it.
Bake this into the Alertmanager template or Datadog monitor template. Reject alerts in code review that lack a runbook link.
Include the last 3 deploys to the relevant service. Half of all incidents trace to a recent deploy; surfacing that saves the responder 10 minutes.
How to ship an alert this week
Start from one SLO per critical service. Define availability and latency, set a 99.9% target, and configure multi-window burn-rate alerts.
Run the alert in shadow mode for 7 days. Log fires to a Slack channel without paging. Tune the threshold based on what fired.
Promote to paging only after the shadow week is clean. Add a quarterly review to retire alerts that have not fired in 90 days.