Alert Storm Detection

Many alerts at once is itself signal. Detect storms.

What an alert storm is

An alert storm is hundreds of alerts in minutes, often from one root cause expressed across every service that depends on it. Storms are operationally dangerous: Alertmanager queues fill, PagerDuty rate-limits, Slack channels become unreadable, on-call gets DDoSed. The fix isn’t more alerts; it’s an explicit storm path.

How to detect a storm

Three signals catch storms. Threshold on alerts-per-minute (above 50/min for 5 minutes is a storm condition for most mid-size teams); distinct alert names firing simultaneously (20+ within 2 minutes suggests a shared dependency); Alertmanager queue depth (sustained growth is a leading indicator).

Storm-mode behavior

Storm-mode is opinionated. Auto-suppress lower severities (sev2 and sev3 muted for the duration); send a single “storm in progress” page to on-call with top alert names by count, top services, top regions; open the war room channel automatically with links to the storm dashboard, recent deploys, and recent infrastructure changes.

After the storm

Post-storm hygiene closes the loop. Auto-resolve suppressed alerts that no longer match conditions and manually review any that persist; generate a storm report (total fires, distinct alert names, root cause, time to suppress, time to all-clear); add suppression rules to the catalog if the storm pattern is likely to recur.

Build storm-mode before you need it

Storms happen 1-2 times a year for most teams. The cost of building storm-mode is low and the cost of not having it during the first storm is high; skip if alert volume is under 100/day because thresholds won’t trigger usefully; test storm-mode in a chaos drill quarterly by injecting 200 synthetic alerts and confirming the suppression and reporting work.