Alert Priority vs Severity
Two attributes; different. Both matter.
Priority and severity are different axes
Severity is impact: how broken is the service for users. Priority is response order: which incidents do we work on first. Conflating the two is why backlog grooming sessions devolve, because a sev3 can still be P1 for a customer with a contractual deadline.
- Severity is impact. Sev1 means production is broken for users; Sev3 means a feature is degraded for a small cohort.
- Priority is response order. P1 is acted on first; P3 waits; priority sequences the queue.
- Different axes. Sev3 can still be P1 for a customer with a contractual deadline; the two axes are independent.
- Per-incident both fields. Severity captures the technical impact; priority captures the business response order.
How to define severity
Severity definitions need to be sharp enough to trigger the right response automatically. Sev1 is revenue-impacting and customer-visible with no workaround; sev2 is degraded but not broken; sev3 is minor or cosmetic. Each tier maps to a specific response posture.
- Sev1. Revenue-impacting, customer-visible, no workaround; page on-call, call the war room.
- Sev2. Degraded but not broken; workaround exists; ticket, business-hours response.
- Sev3. Minor or cosmetic; backlog; triage at next standup.
- Per-tier response posture. The severity definition triggers the response, not just the urgency feeling.
How to define priority
Priority is about scheduling: when does this work happen relative to everything else. P1 cancels other work; P2 fits in this sprint and blocks the next; P3 sits in the backlog and gets reviewed at planning. Three tiers is enough; more fragments the data.
- P1. Do this right now; cancels other work.
- P2. Do this in this sprint; blocks the next.
- P3. Backlog; reviewed at planning.
- Per-tier scheduling rule. The priority binds the work to a sprint window; supports planning predictability.
Mapping the two
The matrix surfaces interesting cases. Most sev1 incidents are P1 and most sev3 issues are P3, but sev2/P1 (a workaround exists but the customer pays for fast fix) and sev1/P2 (production broken but contained, post-incident work scheduled) are the cases that need explicit policy.
- Common cases. Most sev1 incidents are P1; most sev3 issues are P3; the matrix mostly aligns.
- Sev2/P1. Workaround exists but customer pays for fast fix; the off-diagonal case worth naming.
- Sev1/P2. Production broken but contained; post-incident work scheduled; the other off-diagonal case.
- RACI matrix. Map both onto on-call runbooks and backlog grooming; both fields required for any new ticket or alert.
Standardize before scaling
Standardisation is cheap before scaling and expensive after. Pick definitions before adding more teams; keep three tiers each; audit usage quarterly because tier inflation (everyone tags everything sev1) is the failure mode to watch for.
- Pick definitions early. Before adding more teams; late renames break dashboards and runbooks.
- Three tiers each. More tiers fragment the data without improving response.
- Quarterly usage audit. Tier inflation is the failure mode; everyone tags everything sev1 if not policed.
- Per-org tier policy. The tier definitions and audit cadence committed to the engineering handbook.