Alert Severity Matrix Cheat Sheet

A four-row table that ends the "is this a Sev-2 or a Sev-3?" debate. Paste it into your runbook and stop arguing during pages.

Sev-1, critical, customer-facing

The product is down or unusable for a meaningful chunk of paying customers. Money is leaving the building per minute. Wake people up.

Trigger: complete outage, data loss, security breach, payment failure, >25% error rate on critical path
Response time: 5 minutes (page acknowledged)
Channel: PagerDuty page (loud), dedicated #inc-sev1-<ID> channel, video bridge
Escalation: primary on-call → secondary at +10m → engineering manager at +20m → VP at +30m
Comms cadence: status page within 15m, customer email at 30m, internal update every 30m
Roles required: Incident Commander, Comms Lead, Tech Lead, separate humans
Exit criteria: error rate < baseline for 30m, status page resolved, customer email sent, post-mortem scheduled within 48h

Significant customer impact but the product still mostly works. Page during business hours; on-call during off-hours.

Trigger: 5-25% error rate, p95 latency >3× baseline, single region down with failover working, key feature broken
Response time: 15 minutes
Channel: PagerDuty page, #inc-sev2-<ID> channel, optional video bridge
Escalation: primary on-call → secondary at +20m → manager at +60m
Comms cadence: status page within 30m if customer-visible, internal update hourly
Roles required: Incident Commander + Tech Lead can be the same person if needed
Exit criteria: error rate < baseline for 30m, status page updated, post-mortem scheduled within 5 business days

Customers may notice; product is broadly fine. Ticket, no page outside hours.

Trigger: error rate > baseline but < 5%, single non-critical service degraded, slow burn rate, internal tooling broken
Response time: 1 hour during business hours, next morning otherwise
Channel: ticket in queue, ping in #oncall Slack channel
Escalation: ticket-owner team lead at +24h if untouched
Comms cadence: usually none; status page only if customers report
Roles required: one engineer
Exit criteria: ticket closed, root cause noted, no post-mortem unless pattern repeats

Things that should be fixed but don't move the needle. The triage backlog.

Trigger: dashboard typo, stale doc, near-miss alert, single failed batch job that retried successfully
Response time: this sprint or next
Channel: ticket in normal backlog
Escalation: none, closed if not picked up in 30 days
Comms cadence: none
Roles required: whoever picks it up
Exit criteria: ticket closed or cancelled

The matrix is a starting point, not a contract. Three principles keep it honest.

If in doubt, declare higher. Downgrading a Sev-2 to a Sev-3 mid-incident is fine; upgrading a Sev-3 at hour two is painful
Declare on impact, not cause. "DB at 80% CPU" is not a sev, "checkout broken for half of users" is
Page on user-visible symptoms only. CPU alerts, disk alerts, queue-depth alerts are tickets unless they correlate with user impact
One Incident Commander, always. Two ICs is no IC. The on-call who declares becomes IC by default until handed off
Exit criteria written in advance. "Error rate < X for Y minutes", not "feels stable"
Sev-1 demands a post-mortem within 48h. The freshness of the memory matters more than the polish of the doc