Incident vs Alert: Different Things
An alert is a signal. An incident is the response.
Alerts and incidents are different
An alert is a signal: a threshold was crossed, a check failed; cheap to fire, cheap to resolve. An incident is the human response: someone is engaged, status is communicated, customers may be affected. Conflating them creates two failure modes: every alert becomes a manual incident (toil) or real incidents are buried in alert noise.
- Alert is a signal. Threshold crossed, check failed; cheap to fire, cheap to resolve.
- Incident is the response. Someone engaged, status communicated, customers may be affected.
- Two failure modes from conflating. Every alert becomes a manual incident (toil), or real incidents buried in noise.
- Per-alert escalation criterion. The criterion that promotes alert to incident is documented; supports correct triage.
Aim for a low alert-to-incident ratio
The ratio is the signal-to-noise health metric. Healthy: 1 incident per 5-10 alerts; below that means alerts are noisy; above that means you’re missing real problems. Track per team and per service because the trend is more useful than the absolute number; when the ratio drops, retire alerts, and when it rises, look at recent code changes.
- Healthy ratio: 1:5-10. One incident per 5-10 alerts; the operating range.
- Below: noisy alerts. Too many alerts for the actual incident rate; the noise budget is blown.
- Above: missing problems. Too few alerts; real incidents not being caught early.
- Per-team and per-service tracking. The trend matters more than the absolute number.
Automate the alert-to-incident step
Auto-create incidents from alerts but don’t do it for every alert. PagerDuty, Incident.io, FireHydrant create incidents from alerts based on severity, label, count over time; the rule must require escalation criteria (sample: 3 sev1 alerts on the same service in 10 minutes triggers an incident with status page integration).
- Tool support. PagerDuty, Incident.io, FireHydrant; all create incidents from alerts on rules.
- Don’t auto-create per alert. Defeats the purpose; create only when escalation criteria met.
- Sample rule. 3 sev1 alerts on same service in 10 minutes; status page integration.
- Per-rule documented. The escalation rule committed; supports investigation when an incident is auto-created.
What gets a postmortem
Postmortems are for incidents, not alerts. A noisy alert is a cleanup item, not a postmortem; sev1 incidents always get postmortems, sev2 that exceed time-to-resolve targets do, customer-impacting events regardless of severity do; track incident counts over time because alert counts are noise but incident counts are signal.
- Incidents, not alerts. Noisy alert is a cleanup item, not a postmortem.
- Sev1 always. Postmortem mandatory for the highest-stakes incidents.
- Sev2 that exceed TTR targets. Plus customer-impacting events regardless of severity.
- Incident counts as signal. Alert counts are noise; incident counts drive investment.
How to introduce the distinction
Three steps introduce the distinction. Pick a tool (PagerDuty’s incident object, Incident.io’s full workflow, or a homegrown table); define the auto-creation rule (which alerts auto-create incidents, which require a human); train the rotation so “did this become an incident?” is the post-shift question, not “did you get paged?”.
- Pick a tool. PagerDuty incident object, Incident.io workflow, or homegrown table.
- Define auto-creation rule. Which alerts auto-create incidents, which require a human.
- Train the rotation. “Did this become an incident?” replaces “did you get paged?”
- Per-team adoption record. Each team’s adoption of the distinction tracked; supports rollout.