Actionable vs Informational Alerts

If a human can’t act on it in five minutes, it shouldn’t page. The two-question test that separates the two, the inbox tier for the rest, and the weekly drain.

Why the line matters

The cost of a page isn’t the alert, it’s the cost of waking a human up. Sleep interruption is the most expensive thing in your reliability budget; you should be ruthless about what you spend it on. The line between actionable and informational is the line between “wake the human” and “let them read it Tuesday morning.”

The pathology of mixing them. When informational alerts share a channel with actionable ones, the on-call gets numb. The brain learns to dismiss alerts in bulk; the actionable ones get dismissed alongside the noise; the next real incident is the one that gets missed. Mixing the two tiers degrades both.

The inverse pathology. Some teams swing the other way and demote real alerts to email because they’re scared of paging. The result: a real customer-facing incident that no one sees for three hours because everyone’s in a meeting. Both extremes cost more than the discipline.

The two-question test

Before any alert routes to a pager, it has to pass two questions:

Question 1: Can a human act on this in 5 minutes? “Act” means: change a config, restart a service, page a different team, roll back a deploy. “In 5 minutes” means without going to a meeting or finding a runbook from scratch. If the answer is no, it’s not actionable.

Question 2: Does waiting cost the customer? “Waiting until tomorrow” vs “acting now”, if the customer impact is the same in both cases, it’s informational. The whole point of waking someone up is that the wait is expensive; if it isn’t, don’t wake them.

Both yeses: page. One yes: send to inbox. Both nos: delete the alert. The third bucket is the one most teams skip and where most of the cleanup wins live.

The inbox tier

Informational alerts are real signals; they’re just not 3am-worthy. They belong in a tier that doesn’t buzz the phone but still gets read.

The shape of the inbox tier. A Slack channel (or email folder) named like #sre-inbox, with a 24-hour SLA on triage. The on-call doesn’t have to act in real time; the team drains the queue at the start of each business day. The alert is captured, persisted, and visible, just not waking anyone.

What goes there. Capacity warnings (disk at 70%, queue at 60% of redline), data-quality alerts (stale dashboard, late job), trend warnings (error rate up 20% on a low-volume service), one-off anomalies (a 99.9p that hit 99.95p). All real, none urgent.

The discipline. The inbox is only useful if it’s drained. Set a weekly drain meeting; assign each alert to either “close (false positive)”, “close (acted)”, or “escalate (now actionable).” The drain takes 30 minutes; the team learns more from it than any other meeting.

The weekly drain

The drain is a 30-minute meeting where the on-call rotation reviews the inbox and decides on each alert. Three outcomes per alert.

Resolved (no action needed). The alert was a false positive, or the underlying condition self-resolved. Mark and close. Note: if a particular alert is repeatedly resolved-no-action, it’s a candidate for deletion.

Acted on (already done). Someone saw it during the week, did the thing, didn’t need a page. Capture what they did in the runbook. Close.

Escalated (now actionable). The condition has gotten worse since the original alert; it now meets the page bar. Promote to a pager rule (or page right now if it’s urgent). The inbox caught it before it became an incident.

The metric to watch. Inbox-to-page promotion rate. If 50% of inbox items get promoted, the bar is too high, some of these should have been actionable from the start. If 0% get promoted, the bar is too low, you’re routing actionable alerts to inbox. The healthy range is 5-15%.

Worked examples

Disk at 70%. Can a human act in 5 minutes? Maybe (run cleanup, expand volume). Does waiting cost the customer? Not yet. Inbox.

Disk at 95%. Can a human act in 5 minutes? Yes. Does waiting cost the customer? Yes, in 30 minutes the disk fills and writes fail. Page.

Stale dashboard. Can a human act in 5 minutes? Yes (kick the dashboard job). Does waiting cost the customer? Customer doesn’t see the dashboard. Inbox.

Customer cart errors at 5%. Can a human act in 5 minutes? Yes (rollback, restart). Does waiting cost the customer? Yes, cart conversion drops every minute. Page.

SSL cert expires in 14 days. Can a human act in 5 minutes? Yes (renew). Does waiting cost the customer? Not for 13 days. Inbox, with a re-route to page when it hits 24 hours.

Background job failed once. Can a human act in 5 minutes? Maybe (re-run). Does waiting cost the customer? Not yet, one failure is normal. Inbox; promote when failure-rate exceeds threshold.

What to do this week

Three moves. (1) Look at last week’s pages, for each, ask the two questions. The ones that fail get demoted to inbox this week; the on-call sleeps better immediately. (2) Stand up the inbox channel and the weekly drain. The first drain catches up the backlog; subsequent ones take 20 minutes. (3) Track inbox-to-page promotion rate as a quarterly KPI. The trend tells you whether your alerting is calibrated to the right bar.