The Alert Rate Limit Pattern
Some alerts can flood. Rate limit them.
Why rate limiting alerts matters
A failing dependency can fire 10,000 alerts in an hour, each identical, flooding the pager and losing real signal. Rate limiting caps the number of alert events per group per hour at the alerting layer. It is not deduplication: dedup collapses identical alerts; rate limiting caps even non-identical ones from the same source.
- 10,000 alerts/hour scenario. A failing dependency can produce them; pager floods and real signal is lost.
- Cap per group per hour. The alerting-layer rate limit; bounds total page volume.
- Not deduplication. Dedup collapses identical alerts; rate limiting caps non-identical ones from the same source.
- Per-storm protection. Rate limiting is the storm-survival mechanism; supports response when noise spikes.
How to implement
Each tool has its own mechanism. Alertmanager: group_interval and repeat_interval per route, with repeat_interval 4h non-critical and 1h critical. PagerDuty: event rules with rate-limit actions, drop or downgrade above N/hour. OpsGenie: notification policies with frequency caps.
- Alertmanager.
group_intervalandrepeat_intervalper route; 4h non-critical, 1h critical. - PagerDuty. Event rules with rate-limit actions; drop or downgrade events above N/hour.
- OpsGenie. Notification policies with frequency caps; the same primitive in different UI.
- Per-tool default. Each tool’s default committed to config; supports consistent rate limiting.
What to rate-limit by
The right dimensions are alert name plus service plus region (not alert name alone because that hides regional outages); owner team (a team should not receive more than 10 distinct pages per hour, escalate above); integration source (if a webhook starts spamming, cap the source before it floods downstream).
- By alert name + service + region. Not by alert name alone; alone hides regional outages.
- By owner team. No more than 10 distinct pages per hour; above that, escalate to the team lead.
- By integration source. If a webhook starts spamming, cap the source before it floods downstream.
- Per-dimension cap policy. Each dimension has a documented cap; supports consistent application.
When rate limiting hides outages
Rate limiting that silently drops alerts is dangerous: the outage is happening but the page count is fake. Always log dropped alerts; add a meta-alert if drops exceed N/hour; prefer suppression-with-marker (“5,000 similar alerts suppressed”) over hard drops because the suppressed count is the signal.
- Silent drops dangerous. Outage happening, page count fake; the worst-case rate limit failure.
- Always log drops. Dropped alerts logged for visibility; supports investigation.
- Meta-alert on drops. If drops exceed N/hour, the meta-alert fires; the rate-limit itself is monitored.
- Suppression-with-marker. “5,000 similar alerts suppressed”; the suppressed count is the signal, not lost data.
Default settings
The defaults are opinionated. repeat_interval 1h sev1, 4h sev2, off sev3 stops oscillation without losing visibility; per-team cap of 10 distinct alerts per hour with escalate-or-batch above; test by simulating an outage with 1,000 alerts in 5 minutes against staging to confirm the bounded number reaches paging.
- repeat_interval defaults. 1h sev1, 4h sev2, off sev3; stops oscillation without losing visibility.
- Per-team cap. 10 distinct alerts per hour; above that, escalate or batch.
- Test by simulation. Fire 1,000 alerts in 5 minutes against staging; confirm the bounded number reaches paging.
- Per-deploy verification. Rate-limit changes verified via simulation before promotion; supports safe rollout.