Alerts Practical By Samson Tanimawo, PhD Published Feb 6, 2026 4 min read

Alert Storm Detection

Many alerts at once is itself signal. Detect storms.

What an alert storm is

Hundreds of alerts in minutes. Often one root cause expressed across every service that depends on it.

Storms are operationally dangerous: Alertmanager queues fill, PagerDuty rate-limits, Slack channels become unreadable, on-call gets DDoSed.

The fix isn't more alerts; it's an explicit storm path.

Threshold on alerts-per-minute. Above 50/min for 5 minutes is a storm condition for most mid-size teams.

Track distinct alert names firing simultaneously. 20+ distinct names within a 2-minute window suggests a shared dependency.

Watch Alertmanager's queue depth. Sustained queue growth is a leading indicator before storm conditions become visible.

Auto-suppress lower severities. A storm condition mutes sev2 and sev3 for the duration.

Send a single 'storm in progress' page to on-call. Includes top alert names by count, top services, top regions.

Open the war room channel automatically. Include links to the storm dashboard, recent deploys, and recent infrastructure changes.

Auto-resolve suppressed alerts that no longer match conditions. Manually review any that persist.

Generate a storm report: total fires, distinct alert names, root cause, time to suppress, time to all-clear.

Add suppression rules to the catalog if the storm pattern is likely to recur.

Storms happen 1-2 times a year for most teams. The cost of building storm-mode is low; the cost of not having it during the first storm is high.

Skip if your alert volume is under 100/day total. The thresholds won't trigger usefully.

Test storm-mode in a chaos drill quarterly. Inject 200 synthetic alerts and confirm the suppression and reporting work.