The Alert-Storm Response Playbook
2000 alerts in 5 minutes. Your phone is a buzzing brick, your screen is unreadable, your channel is screaming. Here is the only sequence that actually works.
Why storms happen
Storms have one of three shapes. (1) A shared dependency goes down, a database, a region, a TLS cert, and every service that touches it pages independently. (2) An alert rule has a glob match that’s too wide, one symptom expands into hundreds of pages. (3) A monitoring system itself flaps, the alerter loses connection to Prometheus and replays a backlog as fresh alerts.
The common cause underneath. Alerts are written per-service but failures are shared-fate. When the database is down, every service that uses it has a real problem, the alerts aren’t wrong, they’re just redundant. Storm response is about cutting through the redundancy fast enough to find the root.
The size threshold that matters. Below 20 alerts in 5 minutes, a human can read them all. Above 200, no one can. The middle band is where things break: enough to overwhelm, not enough to obviously be a storm. Set up storm detection at the 50-per-minute mark; it’s the threshold where the playbook should kick in.
First three actions
The temptation when 2000 alerts hit is to start reading them. Don’t. Three actions, in order, before you read anything.
Action 1: Acknowledge mass. Open the pager dashboard, multi-select all firing alerts, ack as “storm in progress.” Stop the buzzing first; you can’t think while your phone vibrates every two seconds. The ack doesn’t mean the alert is resolved, it means you’ve seen the storm.
Action 2: Find the cluster. Group by service, by host, by tag. The 2000 alerts almost always reduce to 3-7 clusters. Group-by-tag in your alert system, sort by count, look at the top three groups. The cluster heads are the candidates for the root cause.
Action 3: Look at the time series, not the alerts. The clusters are the noise; the time series are the signal. Pull the four golden signals (latency, error rate, saturation, traffic) for the top-3 clusters and look for the one that started first. The earliest mover is almost always the root.
Triage during the storm
Once you have the candidate root, the rest of the playbook is normal incident response with one twist: silence the noise so the rest of the team can hear you.
Silence the symptoms. As soon as you have the root candidate, silence the symptom alerts, the ones triggered by the same root cause. In Alertmanager that’s a silence with the cluster tag; in PagerDuty it’s a maintenance window on the affected services. The silence is for 1 hour; if the root is mis-diagnosed, the alerts will return.
Open one channel. The instinct in a storm is to spawn a thread per service. Don’t. One incident channel; one IC; one stream. The information density of one channel beats six fragmented ones every time.
Communicate the cluster, not the alerts. Update the channel with the cluster summary every 10 minutes: “3 clusters, candidate root is the orders DB pool exhaustion, ETA on restart 5 min.” The team needs to know what you know, not the raw firehose.
After-storm cleanup
The storm ends; the silences expire; alerts resolve. The temptation now is to declare victory and go back to bed. The work isn’t done, the post-storm cleanup is what determines whether the next one is half as bad or twice.
The 24-hour rule. Within 24 hours of the storm, write the post-mortem. Include: which alerts fired, which clusters reduced to which root, what the timeline looked like. The data fades fast; capture it while it’s warm.
The dependent-alerts cleanup. Identify the 50+ alerts that all fired because of the one root cause. They aren’t wrong, they’re redundant. Either suppress them when the root is firing (alertmanager inhibition rules), or roll them up to a single “orders cluster down” alert with a count.
The runbook update. The runbook for the root cause should now mention the symptom storm. The next on-call who sees the storm should land on a runbook that says “this is the orders DB pool storm; here’s the silence rule; here’s the restart command.” That’s the lesson, captured.
Preventing the next one
Storms are a symptom of a structural problem in the alert design. The fix isn’t bigger pagers; it’s fewer alerts that mean more.
Pattern 1: inhibition rules. When the database alerts, suppress the dependent service alerts for the duration. Most alert systems support this, Prometheus Alertmanager calls them inhibition rules, PagerDuty calls them dependency suppression. Set them up for your top 5 shared dependencies; the next storm shrinks 10×.
Pattern 2: SLO-based alerting. Service-level alerts (latency, error rate, saturation, traffic on the user-facing surface) instead of node-level alerts (this disk is 80% full). One SLO alert covers thousands of underlying incidents; the storm collapses to a single page.
Pattern 3: storm-detection rule. Add a meta-rule that fires when alert volume exceeds 50/minute. Send it to a separate channel; it’s a signal for the on-call to switch to storm mode, ack-mass, find-cluster, find-root. The rule doesn’t replace the underlying alerts; it primes the response.
What to do this week
Three moves. (1) Look at the last 90 days for any incident with >100 alerts in 5 minutes. Count them; pick the top 3. (2) For each, identify the inhibition rule that would have collapsed it, usually 1 line in alertmanager.yml. (3) Add a storm-detection meta-rule that fires at 50 alerts/minute and routes to a low-priority channel. The first time it fires, you’ll be glad you have it; the on-call can switch to playbook-mode without thrashing.