Your 3 a.m. Alerts Are Telling You Something (It's Usually Not About Production)
When the on-call Slack is full of pages that resolve themselves in 90 seconds, the problem isn't the infrastructure. It's the alerting philosophy.
The real diagnosis
Teams drowning in alert noise usually look at the alerts and ask “how do we make these quieter?” The better question is “what would my on-call do right now if this page fired?” If the answer is “wait and see,” the page should not exist.
Three buckets every alert falls into
- Informational: a thing happened. Nobody needs to act right now.
- Investigative: a thing happened that might need action if the pattern continues.
- Actionable: a thing is happening that the on-call should intervene on immediately.
Only the third should page at 3 a.m. The first two belong in a dashboard or a Slack channel, not in your on-call engineer's phone.
Cut the first bucket ruthlessly
Audit every alert rule. For each, ask: “when this fires, does the on-call take immediate action?” If the honest answer is “no, they check it and it usually resolves,” delete the alert or send it to a low-urgency channel.
In practice, about 40% of alerts at most teams are in bucket 1. Deleting them sounds scary; the morning after they're gone, the team sleeps better.
Rewrite the second
Investigative alerts should not page during sleep hours. They belong in a daily review: a channel the on-call scans during business hours, or a digest email.
If you can't figure out whether an alert is bucket 2 or bucket 3, run it for a month in bucket 2. If the on-call ever wishes they'd been paged for it overnight, promote it. Otherwise keep it daytime.
The third is sacred
Alerts in bucket 3 should be tuned aggressively. False positives in bucket 3 destroy on-call morale more than any other factor in this post.
The target: every page during sleep hours results in action within 5 minutes. If that ratio drops below 80%, you have drifted bucket-2 into bucket-3 again. Do the audit. Most teams end up with 5,10 true bucket-3 alerts; no more than that is sustainable.
False positives in bucket 3 destroy on-call morale more than any other factor.
The audit, step by step
Export every alert rule into a spreadsheet. One row per rule, one column for bucket classification.
Score each rule. For each, the question is: 'what did the on-call do the last three times this fired?' If the honest answer is 'checked it and closed the ticket', it is not actionable.
Delete or demote. Route bucket 1 to a digest, bucket 2 to a daytime Slack channel, bucket 3 to the pager. Expect to cut your pager volume by more than half in the first round.