Alert Storm Detection
Many alerts at once is itself signal. Detect storms.
What an alert storm is
An alert storm is hundreds of alerts in minutes, often from one root cause expressed across every service that depends on it. Storms are operationally dangerous: Alertmanager queues fill, PagerDuty rate-limits, Slack channels become unreadable, on-call gets DDoSed. The fix isn’t more alerts; it’s an explicit storm path.
- Hundreds of alerts in minutes. One root cause expressed across every dependent service.
- Operational dangers. Alertmanager queues fill, PagerDuty rate-limits, Slack unreadable, on-call DDoSed.
- Fix is the storm path. Not more alerts; an explicit storm-handling mode.
- Per-storm playbook. The storm response is its own playbook; supports correct response under pressure.
How to detect a storm
Three signals catch storms. Threshold on alerts-per-minute (above 50/min for 5 minutes is a storm condition for most mid-size teams); distinct alert names firing simultaneously (20+ within 2 minutes suggests a shared dependency); Alertmanager queue depth (sustained growth is a leading indicator).
- Alerts-per-minute threshold. Above 50/min for 5 minutes; the storm-condition signal.
- Distinct alert names. 20+ within a 2-minute window suggests a shared dependency.
- Queue depth growth. Sustained growth in Alertmanager queue is a leading indicator.
- Per-team threshold tuning. Thresholds calibrated to the team’s normal volume; supports correct firing.
Storm-mode behavior
Storm-mode is opinionated. Auto-suppress lower severities (sev2 and sev3 muted for the duration); send a single “storm in progress” page to on-call with top alert names by count, top services, top regions; open the war room channel automatically with links to the storm dashboard, recent deploys, and recent infrastructure changes.
- Auto-suppress lower severities. Sev2 and sev3 muted for the storm duration.
- Single storm page. Top alert names, top services, top regions; the on-call gets one signal.
- Auto-open war room. Links to storm dashboard, recent deploys, recent infrastructure changes.
- Per-storm context links. The page includes the context the on-call needs in 30 seconds.
After the storm
Post-storm hygiene closes the loop. Auto-resolve suppressed alerts that no longer match conditions and manually review any that persist; generate a storm report (total fires, distinct alert names, root cause, time to suppress, time to all-clear); add suppression rules to the catalog if the storm pattern is likely to recur.
- Auto-resolve suppressed. Alerts that no longer match conditions; manually review any that persist.
- Storm report. Total fires, distinct alert names, root cause, time to suppress, time to all-clear.
- Suppression rules added. If the storm pattern is likely to recur, the rule lands in the catalog.
- Per-storm postmortem. Each storm produces a documented learning; supports prevention.
Build storm-mode before you need it
Storms happen 1-2 times a year for most teams. The cost of building storm-mode is low and the cost of not having it during the first storm is high; skip if alert volume is under 100/day because thresholds won’t trigger usefully; test storm-mode in a chaos drill quarterly by injecting 200 synthetic alerts and confirming the suppression and reporting work.
- Build before need. 1-2 storms per year; build cost is low, no-build cost is high.
- Skip under 100/day. Thresholds won’t trigger usefully at low volume.
- Quarterly chaos drill. Inject 200 synthetic alerts; confirm suppression and reporting work.
- Per-drill verification. Each drill validates the storm path; supports preparedness.