Alert Action Distinction
Alerts that fire actions vs alerts that just notify. The pattern.
Two classes of alert
Action and notification are different surfaces. Action alerts demand a human response within minutes (page someone, wake them up, expect a runbook); notification alerts inform but require no action right now (they land in Slack, email, or a ticket queue for the next business day); mixing the two is the root cause of fatigue because a pager that fires for FYI events trains responders to ignore it.
- Action alerts. Human response within minutes; page, wake, runbook.
- Notification alerts. Inform, no immediate action; Slack, email, ticket queue.
- Mixing causes fatigue. Pager for FYI trains responders to ignore.
- Per-class destination. Different surfaces; never share channels.
How to classify each rule
Three rules guide classification. Ask the runbook question: does the responder do something specific in the next 15 minutes? If not, it’s a notification, not a page. Use severity tiers explicitly: Sev1 pages on-call, Sev2 opens a ticket, Sev3 emails the owning team and map each alert to one tier at creation. Reject rules without a runbook because no runbook means no action which means no page.
- 15-minute action test. If no specific action, it’s notification.
- Three explicit tiers. Sev1 page, Sev2 ticket, Sev3 email; map at creation.
- No runbook = no page. The ironclad rule.
- Per-rule classification gate. CI rejects rules missing tier or runbook.
Routing the two cleanly
Routing keeps the channels clean. Alertmanager receivers split by severity label (PagerDuty for sev1, Slack webhook for sev2, email for sev3, no rule sends to more than one tier); disable mobile push for the notification channel because the phone is reserved for action; ticket creation should be idempotent using the alert fingerprint as the ticket key to avoid duplicates during flapping.
- Receiver split by severity. PagerDuty sev1, Slack sev2, email sev3.
- Mobile push off for notifications. Phone reserved for action.
- Idempotent ticket creation. Alert fingerprint as ticket key; no flap-induced duplicates.
- Per-tier dedicated channel. No cross-tier routing; supports clear discipline.
Review cadence
Three review patterns keep classification accurate. Quarterly: scan the action tier and demote any alert that produced no remediation in 90 days (promotion in the other direction is rare and needs a post-incident finding); track demotion rate as a noise indicator (a team demoting 20% of action alerts per quarter is signaling initial classification is wrong); audit Slack-only alerts too because a notification everyone ignores is clutter and should be deleted.
- Quarterly action-tier scan. No remediation in 90 days: demote.
- Demotion rate as noise indicator. 20% per quarter: classification is wrong.
- Promotion needs evidence. Post-incident finding only; not the default.
- Slack-only audit. Ignored notifications are clutter; delete.
Default to notification
The default is conservative. When in doubt, classify a new alert as notification because promotion to action requires evidence of an incident missed by the lower tier; this inverts the common reflex of paging on everything (inverting it is the point); skip if your team has fewer than 30 alerts total because classification overhead is bigger than the noise.
- Default: notification. When in doubt, demote; promotion needs evidence.
- Inverts paging reflex. The point: stop reflexive paging.
- Skip below 30 alerts. Overhead exceeds the noise; not worth it.
- Per-org classification policy. Documented; supports consistent application.