The Alert Rate Limit Pattern

Some alerts can flood. Rate limit them.

Why rate limiting alerts matters

A failing dependency can fire 10,000 alerts in an hour, each identical, flooding the pager and losing real signal. Rate limiting caps the number of alert events per group per hour at the alerting layer. It is not deduplication: dedup collapses identical alerts; rate limiting caps even non-identical ones from the same source.

How to implement

Each tool has its own mechanism. Alertmanager: group_interval and repeat_interval per route, with repeat_interval 4h non-critical and 1h critical. PagerDuty: event rules with rate-limit actions, drop or downgrade above N/hour. OpsGenie: notification policies with frequency caps.

What to rate-limit by

The right dimensions are alert name plus service plus region (not alert name alone because that hides regional outages); owner team (a team should not receive more than 10 distinct pages per hour, escalate above); integration source (if a webhook starts spamming, cap the source before it floods downstream).

When rate limiting hides outages

Rate limiting that silently drops alerts is dangerous: the outage is happening but the page count is fake. Always log dropped alerts; add a meta-alert if drops exceed N/hour; prefer suppression-with-marker (“5,000 similar alerts suppressed”) over hard drops because the suppressed count is the signal.

Default settings

The defaults are opinionated. repeat_interval 1h sev1, 4h sev2, off sev3 stops oscillation without losing visibility; per-team cap of 10 distinct alerts per hour with escalate-or-batch above; test by simulating an outage with 1,000 alerts in 5 minutes against staging to confirm the bounded number reaches paging.