Alert Throttling During Active Incidents

During active incidents, related alerts are noise.

Why throttle during incidents

During a major incident, related alerts fire continuously. The on-call already knows, so additional pages add noise without information. Throttling caps the page rate per service per incident (typical rule: max 1 page per 15 minutes per service while incident is open) because the first page wakes the human and the next 50 are paperwork.

How to throttle

The mechanisms are tool-specific. PagerDuty event orchestration supports rate limiting per service; Opsgenie has alert deduplication policies; Nova AI Ops groups by incident. Throttle by service first, then by team because cross-team throttling can hide unrelated incidents; always log throttled alerts so the on-call can see “these 14 alerts fired but were throttled during incident X”.

When not to throttle

Throttling has hard exceptions. Security alerts (a breach during an outage is more dangerous, not less); data-loss alerts (replication failures and backup misses must always page); new unrelated services (throttle is scoped to the incident, not blanket). The exceptions exist because the throttle’s purpose is noise reduction, not signal suppression.

After the incident

Post-incident hygiene matters. Resume normal alerting within 5 minutes of resolve (a 30-minute throttle window after resolve hides re-occurring issues); audit throttled alerts in the postmortem because sometimes a throttled signal pointed to a separate missed incident; if throttling fires more than weekly, alert volume during incidents is too high so tune underlying signals not the throttle window.

Apply this quarter

The application is targeted. Configure rate limiting on your top 5 services (cap at 1 page per 15 minutes per service while an incident is active); test in a game day (trigger a fake incident, fire 20 alerts, confirm only one pages); review monthly because throttling that suppresses real incidents is worse than no throttling.