Alert Throttling During Active Incidents
During active incidents, related alerts are noise.
Why throttle during incidents
During a major incident, related alerts fire continuously. The on-call already knows, so additional pages add noise without information. Throttling caps the page rate per service per incident (typical rule: max 1 page per 15 minutes per service while incident is open) because the first page wakes the human and the next 50 are paperwork.
- Continuous alert firing. Major incident produces continuous related alerts; the on-call already knows.
- Cap per service per incident. 1 page per 15 minutes per service while incident is open; the typical rule.
- First page wakes, rest paperwork. The cap focuses attention on the original signal.
- Per-service throttle config. Documented per service; supports investigation when throttling fires.
How to throttle
The mechanisms are tool-specific. PagerDuty event orchestration supports rate limiting per service; Opsgenie has alert deduplication policies; Nova AI Ops groups by incident. Throttle by service first, then by team because cross-team throttling can hide unrelated incidents; always log throttled alerts so the on-call can see “these 14 alerts fired but were throttled during incident X”.
- PagerDuty event orchestration. Rate limiting per service; the canonical mechanism.
- Opsgenie deduplication policies. Same primitive in different UI.
- Service-first, team-second. Cross-team throttling can hide unrelated incidents.
- Always log throttled. “These 14 alerts fired but were throttled during incident X”; visibility is mandatory.
When not to throttle
Throttling has hard exceptions. Security alerts (a breach during an outage is more dangerous, not less); data-loss alerts (replication failures and backup misses must always page); new unrelated services (throttle is scoped to the incident, not blanket). The exceptions exist because the throttle’s purpose is noise reduction, not signal suppression.
- Security alerts. A breach during an outage is more dangerous, not less; never throttle.
- Data-loss alerts. Replication failures, backup misses; must always page.
- New unrelated services. Throttle scoped to the incident; the unrelated service still pages.
- Per-exception class documented. Each exception class is documented; supports correct throttle scoping.
After the incident
Post-incident hygiene matters. Resume normal alerting within 5 minutes of resolve (a 30-minute throttle window after resolve hides re-occurring issues); audit throttled alerts in the postmortem because sometimes a throttled signal pointed to a separate missed incident; if throttling fires more than weekly, alert volume during incidents is too high so tune underlying signals not the throttle window.
- Resume within 5 minutes. 30-minute throttle window after resolve hides re-occurring issues.
- Audit throttled in postmortem. A throttled signal sometimes pointed to a separate incident that was missed.
- Weekly throttling means tune signals. Alert volume during incidents is too high; fix the underlying alerts.
- Per-postmortem throttle review. Throttled alerts reviewed every postmortem; supports continuous tuning.
Apply this quarter
The application is targeted. Configure rate limiting on your top 5 services (cap at 1 page per 15 minutes per service while an incident is active); test in a game day (trigger a fake incident, fire 20 alerts, confirm only one pages); review monthly because throttling that suppresses real incidents is worse than no throttling.
- Top 5 services first. Configure rate limiting; cap at 1 page per 15 minutes during incident.
- Test in game day. Fire 20 alerts at a fake incident; confirm only one pages.
- Monthly review. Throttling that suppresses real incidents is worse than no throttling.
- Per-month adherence audit. Adherence to the throttle config measured; supports continuous accuracy.