Severity Levels: A Five-Tier Framework Teams Actually Use
Severity tiers only work if everyone reads them the same way. The five-tier framework is the smallest set that captures real distinctions.
SEV1, outage
Customer-facing outage. Many users impacted. Response: page; war room within 5 minutes; status page within 15.
Examples: site down; payments broken for everyone; auth completely unreachable.
SEV2, degradation
- Significant degradation. Some users impacted, all users slow. Response: page; investigate within 15 minutes.
- Examples: 5xx rate 2x normal; p99 latency tripled; one region partial outage.
SEV3, single-feature
Single feature broken; workaround exists. Response: ticket; investigate within 4 hours.
Examples: one search filter broken; one report fails to render; one webhook delivery delayed.
SEV4/5, internal & informational
SEV4: internal-only impact (nightly batch failed; internal dashboard slow; engineer’s test env broken). Response: ticket; investigate next business day.
SEV5: informational only (capacity trend; cert expiring 30 days out). Response: ticket; investigate next sprint.
Antipatterns
- Three tiers (SEV1/2/3 only). Real distinctions get squashed.
- Eight tiers. Nobody remembers; severity calls become arguments.
- SEV1 for everything customer-impacting. Loses the distinction between ‘1 user’ and ‘all users.’
What to do this week
Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.