The Alert Rate Limit Pattern
Some alerts can flood. Rate limit them.
Why rate limiting alerts matters
A failing dependency can fire 10,000 alerts in an hour. Each is identical. The pager floods and real signal is lost.
Rate limit at the alerting layer: cap the number of alert events per group per hour.
This is not deduplication. Dedup collapses identical alerts; rate limiting caps even non-identical ones from the same source.
How to implement
Alertmanager: group_interval and repeat_interval per route. Set repeat_interval to 4h for non-critical, 1h for critical.
PagerDuty: event rules with rate-limit actions. Drop or downgrade events above N/hour.
OpsGenie: notification policies with frequency caps.
What to rate-limit by
By alert name + service + region. Not by alert name alone; that hides regional outages.
By owner team. A team should not receive more than 10 distinct pages per hour. Above that, escalate to the team lead.
By integration source. If a webhook starts spamming, cap the source before it floods downstream.
When rate limiting hides outages
A rate limit that drops alerts silently is dangerous. The outage is happening; the page count is fake.
Always log dropped alerts. Add a meta-alert if drops exceed N/hour.
Prefer suppression with a marker ("5,000 similar alerts suppressed") over hard drops.
Default settings
repeat_interval = 1h for sev1, 4h for sev2, off for sev3. Stops oscillation without losing visibility.
Per-team cap = 10 distinct alerts per hour. Above that, escalate or batch.
Test by simulating an outage. Fire 1,000 alerts in 5 minutes against staging and confirm only the bounded number reaches paging.