Alert Priority vs Severity

Two attributes; different. Both matter.

Priority and severity are different axes

Severity is impact: how broken is the service for users. Priority is response order: which incidents do we work on first. Conflating the two is why backlog grooming sessions devolve, because a sev3 can still be P1 for a customer with a contractual deadline.

Severity is impact. Sev1 means production is broken for users; Sev3 means a feature is degraded for a small cohort.
Priority is response order. P1 is acted on first; P3 waits; priority sequences the queue.
Different axes. Sev3 can still be P1 for a customer with a contractual deadline; the two axes are independent.
Per-incident both fields. Severity captures the technical impact; priority captures the business response order.

How to define severity

Severity definitions need to be sharp enough to trigger the right response automatically. Sev1 is revenue-impacting and customer-visible with no workaround; sev2 is degraded but not broken; sev3 is minor or cosmetic. Each tier maps to a specific response posture.

Sev1. Revenue-impacting, customer-visible, no workaround; page on-call, call the war room.
Sev2. Degraded but not broken; workaround exists; ticket, business-hours response.
Sev3. Minor or cosmetic; backlog; triage at next standup.
Per-tier response posture. The severity definition triggers the response, not just the urgency feeling.

How to define priority

Priority is about scheduling: when does this work happen relative to everything else. P1 cancels other work; P2 fits in this sprint and blocks the next; P3 sits in the backlog and gets reviewed at planning. Three tiers is enough; more fragments the data.

P1. Do this right now; cancels other work.
P2. Do this in this sprint; blocks the next.
P3. Backlog; reviewed at planning.
Per-tier scheduling rule. The priority binds the work to a sprint window; supports planning predictability.

Mapping the two

The matrix surfaces interesting cases. Most sev1 incidents are P1 and most sev3 issues are P3, but sev2/P1 (a workaround exists but the customer pays for fast fix) and sev1/P2 (production broken but contained, post-incident work scheduled) are the cases that need explicit policy.

Common cases. Most sev1 incidents are P1; most sev3 issues are P3; the matrix mostly aligns.
Sev2/P1. Workaround exists but customer pays for fast fix; the off-diagonal case worth naming.
Sev1/P2. Production broken but contained; post-incident work scheduled; the other off-diagonal case.
RACI matrix. Map both onto on-call runbooks and backlog grooming; both fields required for any new ticket or alert.

Standardize before scaling

Standardisation is cheap before scaling and expensive after. Pick definitions before adding more teams; keep three tiers each; audit usage quarterly because tier inflation (everyone tags everything sev1) is the failure mode to watch for.

Pick definitions early. Before adding more teams; late renames break dashboards and runbooks.
Three tiers each. More tiers fragment the data without improving response.
Quarterly usage audit. Tier inflation is the failure mode; everyone tags everything sev1 if not policed.
Per-org tier policy. The tier definitions and audit cadence committed to the engineering handbook.