Incident Severity: How to Classify SEV1 / SEV2 / SEV3 Without Arguing
Most teams have a sev table on paper that nobody actually applies during an incident. The fix is a one-question test that decides the severity in under 10 seconds.
Why severity is the first decision
The severity you set in the first 60 seconds of an incident drives every other decision: who pages, how fast, who joins the bridge, how often you communicate with customers, whether anyone runs a postmortem. Get it wrong and you either over-spend on a small fire or under-respond to a real emergency. Most teams either default everything to SEV2 (so SEV1 stops feeling urgent) or default to SEV3 to avoid waking people up (so real SEV1s get under-resourced for the first half hour).
Severity is also the contract with the rest of the org. Customer success, executive comms, even legal in some industries, calibrate their response based on the severity tag. Posting "investigating" with no severity is half a signal; posting "SEV1, investigating" tells them to clear their afternoon. Until severity is set, the rest of the response is improvisation.
The two-axis classification table
Two axes, four cells, no debate. Most teams overthink severity by inventing a five-page rubric. The two-axis frame fits on a sticky note.
- Axis 1, user impact: nobody / some / most / all paying users affected. Use rough percentages, <5%, 5-25%, 25-75%, >75%, to short-circuit the "well, technically..." conversation.
- Axis 2, function impact: cosmetic / degraded / one critical flow broken / multiple critical flows broken. "Critical flow" means anything in your top-3 user journeys: signup, login, the main job-to-be-done, checkout if you bill, etc.
Multiply the rough position on each axis. Top-right cell (most/all users × multiple critical flows broken) is SEV1. Bottom-left (nobody × cosmetic) is SEV4. The diagonals are SEV2 (one critical flow broken for most users, OR multiple flows broken for some users) and SEV3 (one flow degraded for some, OR cosmetic for many). When two axes pull different directions, take the higher severity.
The 10-second test
"If we did nothing for an hour, would the CEO walk over?" If yes, SEV1. If "the support inbox would fill up," SEV2. If "internal-only would notice," SEV3. If "nobody would notice," SEV4. The test is rude on purpose; it short-circuits an hour of debate by replacing technical assessment with social-cost assessment, which is what severity actually encodes.
Two corollaries. First, the answer changes with the time of day. An ingestion lag at 2am on a Sunday might be SEV3; the same lag at 9am Monday is SEV2 because more customers will hit it. Second, the answer changes with whose CEO. If your largest customer is in the affected segment, severity goes up regardless of the percentage. Severity is a business decision dressed up as engineering.
The four levels and what each triggers
- SEV1: total outage or data loss. Page everyone, war room within 10 minutes, customer comms within 15 minutes, executive notification, postmortem mandatory. Expect 4-6 engineers on the bridge for the duration.
- SEV2: major degradation or one critical flow broken. Page on-call + service owner, async incident channel, customer comms within 30 minutes, postmortem strongly recommended. Usually 1-2 engineers.
- SEV3: minor degradation, workaround exists. Single engineer ack within an hour, no comms unless it persists past the SLA window. Often resolved without a formal postmortem.
- SEV4: cosmetic or internal-only. Ticket, fixed in next sprint. No on-call response.
Tune the trigger thresholds for your business. A consumer app with 10M users hits SEV1 the moment 1% are affected (100k angry users). A B2B SaaS with 200 enterprise customers hits SEV1 when one of the top 10 is affected. Same severity name; different population math.
Auto-escalation conditions
Some signals jump severity automatically. Bake them into the policy so the IC doesn't have to argue them in the moment.
- Time-based: any incident open longer than 90 minutes auto-promotes from SEV3 to SEV2, from SEV2 to SEV1.
- Customer-driven: a top-10 customer escalation auto-promotes by one level. Two customer escalations within 30 minutes auto-promotes by two.
- Compliance-tagged: anything touching PII / payment data / regulated workloads starts at minimum SEV2 regardless of size.
- Repeating: the same alert firing more than 3 times in 24 hours promotes to SEV2 because it's now a pattern, not a blip.
The auto-escalation rules turn severity from a one-time judgement into a living number that reflects how the situation is actually unfolding.
Resolving severity disagreement
If two engineers disagree on severity, the higher one wins until the IC says otherwise. The cost of overshooting by one level is small, extra people show up, extra comms go out, the team postmortems something they could have postmortemed-light. The cost of undershooting is in the postmortem itself: "we should have called this SEV1 sooner" is one of the most common findings in post-incident reviews.
Common antipatterns. The "let's wait and see" downgrade, engineers afraid to escalate stall on a SEV1 for the first 20 minutes hoping it'll resolve. Don't. The "this isn't really our problem" downgrade, assigned engineer thinks the incident belongs to a different team and refuses to declare. Declare anyway; reassign within the bridge. The "I don't want to wake the VP" downgrade, that's not your call, that's what the policy says.
What to do this week
Three concrete moves. (1) Audit your last 20 incidents. For each, write down what severity it was tagged at and what severity it should have been in hindsight. If more than 30% are wrong, your team's severity intuition is calibrated wrong, schedule a one-hour calibration session next sprint. (2) Print the two-axis table on a Slack canvas pinned to the on-call channel. The visible reference does more than any policy doc. (3) Add the auto-escalation rules to your incident tooling. PagerDuty, Incident.io, Rootly, and FireHydrant all support time-based and customer-tagged auto-escalation; configure it once, free yourself forever.