Intermediate By Samson Tanimawo, PhD Published Aug 24, 2026 4 min read

Incident Severity Decision Tree

"Is this a Sev-1?" should never be a meeting. Three checks, impact, surface, reversibility, resolve it in under 30 seconds. Print it, post it, end the debates.

The 30-second tree

Three yes/no questions in this exact order. The first yes on impact OR a yes on irreversibility makes it a Sev-1; otherwise look at surface to decide between Sev-2 and Sev-3.

Q1. Is revenue, customer trust, or safety at stake right now? → Yes = Sev-1.
Q2. Is the damage permanent or growing without bound? → Yes = Sev-1.
Q3. Is > 50% of users / a critical user journey degraded? → Yes = Sev-2.
None of the above? → Sev-3. Fix in business hours.
Internal-only impact (employee tools broken, no customer effect)? → Sev-4 regardless of size, unless it's blocking incident response itself.
Default to the higher severity when unsure. Downgrading later is cheap; upgrading later loses 30 minutes of response time.

Check 1, Impact

Impact has three flavors: revenue, trust, safety. Any of them at scale is a Sev-1.

Revenue. Checkout broken, payments rejected, signups failing. The hard-dollar number ticks down per minute, that's the test.
Trust. PII leak, wrong data shown to wrong customer, security event. Reputation damage outlasts the outage by months.
Safety. Healthcare, automotive, anything where a customer can be physically harmed. Always Sev-1, no surface threshold.
"Slowness" alone is not Sev-1 unless it's slow enough to break (timeouts, retries hammering downstream). Latency that's annoying = Sev-2.
Customer-facing error rate > 5% sustained > 5 minutes = Sev-1 in most companies. Calibrate the number to your business.
If the CEO would call you about it, it's a Sev-1. Useful gut check.

Check 2, Surface

Once impact is established, surface decides between Sev-1 and Sev-2.

How many users? > 50% = Sev-1 territory regardless of feature. < 5% = often Sev-3 unless the feature is critical.
How many tenants? In multi-tenant systems, "10 tenants down" can be Sev-1 even if those 10 are 1% of users, if they're whales, the impact is asymmetric.
Which user journey? "Login broken" trumps "settings page broken" even at the same percentage. Critical paths get severity bumps.
Geographic surface, "all of EU" is Sev-1 even if EU is < 50% of traffic. Regional outages have regulatory implications.
The default-error mode of every feature should be documented somewhere. "Search degraded falls back to recommendations" is a Sev-3 design; "checkout has no fallback" is a Sev-1 design.

Check 3, Reversibility

The most-missed dimension. A small problem getting worse every minute, or one you can't undo, is a Sev-1 even if today's surface is small.

Data loss in progress, corruption writing to DB, queue draining unprocessed messages. Sev-1 from minute one. The clock matters more than the size.
Cascading failure, one service down, retries are ramping, downstream services are starting to time out. Sev-1, even before the second service falls.
Irreversible, emails sent, refunds processed, customer messages delivered. Once it's out, you can't pull it back. Sev-1 if it's the wrong content.
Reputational decay, customer-visible page covered in errors, social-media-grade. The longer it lasts, the deeper the trust damage. Sev-1 after 5 minutes.
Counter-test, if you go back to bed, will the situation be worse in two hours? Yes = Sev-1.

The Sev-1 vs Sev-2 table

Sev-1. Response time < 5 min. Channels: page primary + secondary. Comms: status page within 15 min. Eng leadership notified. War room. Postmortem required.
Sev-2. Response time < 15 min. Channels: page primary. Comms: internal channel; status page only if customer-visible. Postmortem required.
Sev-3. Response time business hours. Channels: ticket / async. Comms: none required. Postmortem optional.
Sev-4. Internal-only / minor. Track in the backlog. No paging.
Cadence of updates, Sev-1 every 30 min, Sev-2 every hour. Even just "no change" is the right update.
Who can downgrade severity, always the incident commander, not anyone else. Avoids "let me just call this a Sev-2 so we don't have to do a postmortem".

Edge cases & tie-breakers

"It only affects free-tier users", doesn't downgrade severity. Customer trust is built on the Basic tier; treat it like paid.
"It only affects one big customer", if the customer is a top-10 ARR account, it's a Sev-1. Concentration risk is real.
"It's during a maintenance window", the window doesn't change severity, just whether the on-call paged or expected it. Sev-1 actions still happen.
Suspected security incident, auto-Sev-1 + page security-on-call. Even if you're not sure. Wrong-direction calls cost minutes; missed ones cost everything.
Compliance / regulatory triggers (SOC2, HIPAA, GDPR breach), auto-Sev-1 with legal-on-call paged. There are clocks running you can't see.
Tie-breaker for committee, whoever is paged sets initial severity. Disagreements get resolved by the IC after the fact, not before action.