Incident Severity Decision Tree
"Is this a Sev-1?" should never be a meeting. Three checks, impact, surface, reversibility, resolve it in under 30 seconds. Print it, post it, end the debates.
The 30-second tree
Three yes/no questions in this exact order. The first yes on impact OR a yes on irreversibility makes it a Sev-1; otherwise look at surface to decide between Sev-2 and Sev-3.
- Q1. Is revenue, customer trust, or safety at stake right now? → Yes = Sev-1.
- Q2. Is the damage permanent or growing without bound? → Yes = Sev-1.
- Q3. Is > 50% of users / a critical user journey degraded? → Yes = Sev-2.
- None of the above? → Sev-3. Fix in business hours.
- Internal-only impact (employee tools broken, no customer effect)? → Sev-4 regardless of size, unless it's blocking incident response itself.
- Default to the higher severity when unsure. Downgrading later is cheap; upgrading later loses 30 minutes of response time.
Check 1, Impact
Impact has three flavors: revenue, trust, safety. Any of them at scale is a Sev-1.
- Revenue. Checkout broken, payments rejected, signups failing. The hard-dollar number ticks down per minute, that's the test.
- Trust. PII leak, wrong data shown to wrong customer, security event. Reputation damage outlasts the outage by months.
- Safety. Healthcare, automotive, anything where a customer can be physically harmed. Always Sev-1, no surface threshold.
- "Slowness" alone is not Sev-1 unless it's slow enough to break (timeouts, retries hammering downstream). Latency that's annoying = Sev-2.
- Customer-facing error rate > 5% sustained > 5 minutes = Sev-1 in most companies. Calibrate the number to your business.
- If the CEO would call you about it, it's a Sev-1. Useful gut check.
Check 2, Surface
Once impact is established, surface decides between Sev-1 and Sev-2.
- How many users? > 50% = Sev-1 territory regardless of feature. < 5% = often Sev-3 unless the feature is critical.
- How many tenants? In multi-tenant systems, "10 tenants down" can be Sev-1 even if those 10 are 1% of users, if they're whales, the impact is asymmetric.
- Which user journey? "Login broken" trumps "settings page broken" even at the same percentage. Critical paths get severity bumps.
- Geographic surface, "all of EU" is Sev-1 even if EU is < 50% of traffic. Regional outages have regulatory implications.
- The default-error mode of every feature should be documented somewhere. "Search degraded falls back to recommendations" is a Sev-3 design; "checkout has no fallback" is a Sev-1 design.
Check 3, Reversibility
The most-missed dimension. A small problem getting worse every minute, or one you can't undo, is a Sev-1 even if today's surface is small.
- Data loss in progress, corruption writing to DB, queue draining unprocessed messages. Sev-1 from minute one. The clock matters more than the size.
- Cascading failure, one service down, retries are ramping, downstream services are starting to time out. Sev-1, even before the second service falls.
- Irreversible, emails sent, refunds processed, customer messages delivered. Once it's out, you can't pull it back. Sev-1 if it's the wrong content.
- Reputational decay, customer-visible page covered in errors, social-media-grade. The longer it lasts, the deeper the trust damage. Sev-1 after 5 minutes.
- Counter-test, if you go back to bed, will the situation be worse in two hours? Yes = Sev-1.
The Sev-1 vs Sev-2 table
- Sev-1. Response time < 5 min. Channels: page primary + secondary. Comms: status page within 15 min. Eng leadership notified. War room. Postmortem required.
- Sev-2. Response time < 15 min. Channels: page primary. Comms: internal channel; status page only if customer-visible. Postmortem required.
- Sev-3. Response time business hours. Channels: ticket / async. Comms: none required. Postmortem optional.
- Sev-4. Internal-only / minor. Track in the backlog. No paging.
- Cadence of updates, Sev-1 every 30 min, Sev-2 every hour. Even just "no change" is the right update.
- Who can downgrade severity, always the incident commander, not anyone else. Avoids "let me just call this a Sev-2 so we don't have to do a postmortem".
Edge cases & tie-breakers
- "It only affects free-tier users", doesn't downgrade severity. Customer trust is built on the Basic tier; treat it like paid.
- "It only affects one big customer", if the customer is a top-10 ARR account, it's a Sev-1. Concentration risk is real.
- "It's during a maintenance window", the window doesn't change severity, just whether the on-call paged or expected it. Sev-1 actions still happen.
- Suspected security incident, auto-Sev-1 + page security-on-call. Even if you're not sure. Wrong-direction calls cost minutes; missed ones cost everything.
- Compliance / regulatory triggers (SOC2, HIPAA, GDPR breach), auto-Sev-1 with legal-on-call paged. There are clocks running you can't see.
- Tie-breaker for committee, whoever is paged sets initial severity. Disagreements get resolved by the IC after the fact, not before action.