Alert Management By Nova AI Ops Team Published Sep 22, 2026 11 min read

Designing Alert Severity Levels: A Framework That Survives Contact with Production

Most severity schemes look reasonable on paper and decay within six months as every alert author marks their alert "critical." Here is a framework for severity that holds up under real production pressure.

Why Severity Design Is Harder Than It Looks

Every alert needs a severity. The severity drives routing (page or Slack), escalation (5 minutes or 30), and reporting (which alerts do we treat as serious enough to count). A well-designed severity scheme makes on-call sustainable. A badly designed scheme produces either alert fatigue (everything is critical, nothing matters) or missed incidents (low-severity alerts get ignored that should have been escalated).

The challenge is that severity is set by individual alert authors, often months or years before the alert ever fires. There is no central reviewer ensuring consistency. The natural drift is toward severity inflation: every author wants their alert taken seriously, so every alert is "critical." Six months later your SEV1 catalog has 400 entries and the on-call team has stopped paying attention.

The framework below is designed to resist that drift. It uses concrete, measurable criteria rather than subjective judgment, and it includes governance mechanisms (review cycles, demotion rules) that keep the scheme honest over time.

3-Tier vs 5-Tier: Which Is Right for You

The two common designs are 3-tier (SEV1 / SEV2 / SEV3) and 5-tier (SEV1 through SEV5). The trade-off is precision versus simplicity.

3-tier works for most teams. SEV1 = page someone now. SEV2 = address within business hours. SEV3 = investigate without paging. The boundaries are crisp, every engineer can hold the definitions in their head, and the routing rules are simple. Recommended for teams under 200 engineers.

5-tier makes sense for larger orgs. The added precision helps when you have multiple service-tier classifications (Tier 1 customer-facing vs Tier 2 internal vs Tier 3 batch) and need to express different urgency for the same severity at different tiers. Major enterprises and ITSM-heavy organizations typically use 5-tier (often borrowed from ITIL).

The mistake is using 5-tier when 3-tier would work, because the extra granularity creates ambiguity at the boundaries. "Is this SEV2 or SEV3?" wastes more on-call time than "is this SEV1 or SEV2?" because the consequences of getting it wrong are smaller and the deliberation is the same.

Defining SEV1: The "Wake Someone Up" Bar

SEV1 is the only severity that pages outside business hours. The definition needs to be tight enough that you would, if you imagined yourself as the engineer being woken at 3 a.m., feel that being woken was justified.

The clearest SEV1 criteria are concrete and measurable:

Customer-facing service is unavailable (failed health check or sustained 5xx rate above 5%) for more than 2 minutes.
Data loss or corruption is occurring or has occurred (any duration).
Security incident in progress (active exploitation, credential compromise, data exfiltration).
Revenue-generating system degraded below an explicit business threshold (e.g., checkout success rate below 95% for more than 5 minutes).
Compliance breach in progress (e.g., PII exposed, GDPR-relevant data leaked).

Note what is not SEV1: high CPU usage on a server, a failed batch job, a slow query, a degraded internal tool. These are real problems but they do not justify waking someone at 3 a.m. They are SEV2 or SEV3 candidates.

The acid test: if this alert fires at 3 a.m. and the engineer asks "could this have waited until morning?" and the honest answer is yes, the alert is not SEV1.

Defining SEV2: Business-Hours Urgency

SEV2 alerts are routed to the on-call rotation but only paged during business hours. Outside business hours, they post to a Slack channel without paging anyone, and the next on-call engineer picks them up at the start of their shift.

SEV2 criteria are typically:

Customer-facing degradation below the SEV1 threshold (e.g., 1-5% error rate, latency above SLO but below the customer-pain threshold).
Internal tool unavailable (the tool is required for engineering work but no customer impact).
Capacity warning (resources at 80%+ utilization with growth trend, where the team needs to act in days but not hours).
Failed automated process that does not have an immediate customer impact (a nightly batch job, a backup that did not complete).
Drift in non-production environment (a staging system in a degraded state).

SEV2 is the largest and most useful category. Most real production issues live here. The trick is keeping it from becoming a dumping ground for every alert that someone "doesn't want to be SEV1 but feels important."

Defining SEV3: Investigate Without Paging

SEV3 alerts post to a team Slack channel and are addressed during normal work. They are not paged at all, ever. The point of SEV3 is to surface anomalies that deserve human attention but do not justify interruption.

SEV3 criteria:

Anomaly detection without confirmed impact (a metric is unusually high but no customer-facing symptom yet).
Single instance issue in a redundant system (one node out of ten unhealthy, the system is still serving).
Long-term trend signal (cost trending up, latency trending up, error budget burning at an unsustainable rate but with weeks of runway).
Non-urgent maintenance reminder (cert expiring in 30 days, dependency reaching EOL).

The defining property of SEV3 is that it is never wrong to ignore one for 24 hours. If ignoring it for a day creates real risk, it is SEV2 or SEV1.

The Customer-Impact Heuristic

When in doubt, the deciding question is always: "What is the customer-facing impact?"

Use this matrix:

Customer impact	Time to act	Severity
Complete service unavailable to customers	Now	SEV1
Significant degradation visible to customers	Within 30 minutes	SEV1 (off-hours: SEV2)
Mild degradation customers may notice	Within 4 hours	SEV2
No customer impact, but system unhealthy	Within 1 business day	SEV2 or SEV3
Anomaly without confirmed impact	Investigate	SEV3

The matrix forces explicit judgment about customer impact. A common failure mode in severity design is alerts that have no customer-impact rationale at all ("CPU is high" or "queue depth is rising") and are nevertheless tagged SEV1 by their author. The matrix makes this look obviously wrong.

Fighting Severity Inflation

Severity inflation is the slow drift where every alert becomes SEV1. It happens because adding an alert is easy and demoting one feels like accusing the original author of overreach. Three governance mechanisms keep the scheme honest:

1. Severity is reviewed at alert creation. Every new alert PR includes the severity in the diff and is reviewed by a teammate (or by an SRE lead for SEV1). The review explicitly asks: "Does this meet the SEV1 criteria? If not, demote it."

2. Pages are reviewed in retrospect. After every SEV1 page, the on-call writes a one-line note: "Was this a real SEV1?" If the answer is no for the same alert three times, the alert is automatically demoted to SEV2.

3. The SEV1 catalog is reviewed quarterly. An SRE lead audits the SEV1 catalog every quarter and aggressively demotes anything that did not actually wake someone with a customer-impacting outage in the past 90 days. The audit takes 2 hours and saves dozens of unnecessary 3 a.m. pages.

The Quarterly Severity Review

The single most useful operational practice for severity hygiene is the quarterly review. Block 2 hours, pull the list of every alert tagged SEV1 in your alerting tool, and walk through each one with this checklist:

Has this alert fired in the last 90 days? If no, demote or delete.
When it fired, did it require immediate action? If no for more than 50% of firings, demote.
Is the criteria still measurable and concrete? If the criteria has drifted to "anomaly that someone might want to look at," demote.
Is the runbook still accurate? If no, fix the runbook (urgent regardless of severity).
Does the routing still work? If the team that owns it has changed, update the route.

Teams that do this review consistently report on-call page volume dropping 40-60% over the first year, with no measurable degradation in incident response. The pages that remain are the ones that genuinely need a human at 3 a.m.

For teams that want to skip the manual review work entirely, AI-native platforms like Nova AI Ops automatically learn which alerts produce real incidents versus noise and propose severity adjustments based on the historical signal. The same AI handles the auto-resolution of the routine SEV1 alerts so the human page volume drops even further. Try Nova to evaluate.