Alert Severity Matrix Cheat Sheet
A four-row table that ends the "is this a Sev-2 or a Sev-3?" debate. Paste it into your runbook and stop arguing during pages.
Sev-1, critical, customer-facing
The product is down or unusable for a meaningful chunk of paying customers. Money is leaving the building per minute. Wake people up.
- Trigger: complete outage, data loss, security breach, payment failure, >25% error rate on critical path
- Response time: 5 minutes (page acknowledged)
- Channel: PagerDuty page (loud), dedicated
#inc-sev1-<ID>channel, video bridge - Escalation: primary on-call → secondary at +10m → engineering manager at +20m → VP at +30m
- Comms cadence: status page within 15m, customer email at 30m, internal update every 30m
- Roles required: Incident Commander, Comms Lead, Tech Lead, separate humans
- Exit criteria: error rate < baseline for 30m, status page resolved, customer email sent, post-mortem scheduled within 48h
Sev-2, major degradation
Significant customer impact but the product still mostly works. Page during business hours; on-call during off-hours.
- Trigger: 5-25% error rate, p95 latency >3× baseline, single region down with failover working, key feature broken
- Response time: 15 minutes
- Channel: PagerDuty page,
#inc-sev2-<ID>channel, optional video bridge - Escalation: primary on-call → secondary at +20m → manager at +60m
- Comms cadence: status page within 30m if customer-visible, internal update hourly
- Roles required: Incident Commander + Tech Lead can be the same person if needed
- Exit criteria: error rate < baseline for 30m, status page updated, post-mortem scheduled within 5 business days
Sev-3, minor issue or warning
Customers may notice; product is broadly fine. Ticket, no page outside hours.
- Trigger: error rate > baseline but < 5%, single non-critical service degraded, slow burn rate, internal tooling broken
- Response time: 1 hour during business hours, next morning otherwise
- Channel: ticket in queue, ping in
#oncallSlack channel - Escalation: ticket-owner team lead at +24h if untouched
- Comms cadence: usually none; status page only if customers report
- Roles required: one engineer
- Exit criteria: ticket closed, root cause noted, no post-mortem unless pattern repeats
Sev-4, cosmetic or signal
Things that should be fixed but don't move the needle. The triage backlog.
- Trigger: dashboard typo, stale doc, near-miss alert, single failed batch job that retried successfully
- Response time: this sprint or next
- Channel: ticket in normal backlog
- Escalation: none, closed if not picked up in 30 days
- Comms cadence: none
- Roles required: whoever picks it up
- Exit criteria: ticket closed or cancelled
Rules of thumb
The matrix is a starting point, not a contract. Three principles keep it honest.
- If in doubt, declare higher. Downgrading a Sev-2 to a Sev-3 mid-incident is fine; upgrading a Sev-3 at hour two is painful
- Declare on impact, not cause. "DB at 80% CPU" is not a sev, "checkout broken for half of users" is
- Page on user-visible symptoms only. CPU alerts, disk alerts, queue-depth alerts are tickets unless they correlate with user impact
- One Incident Commander, always. Two ICs is no IC. The on-call who declares becomes IC by default until handed off
- Exit criteria written in advance. "Error rate < X for Y minutes", not "feels stable"
- Sev-1 demands a post-mortem within 48h. The freshness of the memory matters more than the polish of the doc