Alert Throttling During Active Incidents

During active incidents, related alerts are noise.

Why throttle during incidents

During a major incident, related alerts fire continuously. The on-call already knows, so additional pages add noise without information. Throttling caps the page rate per service per incident (typical rule: max 1 page per 15 minutes per service while incident is open) because the first page wakes the human and the next 50 are paperwork.

Continuous alert firing. Major incident produces continuous related alerts; the on-call already knows.
Cap per service per incident. 1 page per 15 minutes per service while incident is open; the typical rule.
First page wakes, rest paperwork. The cap focuses attention on the original signal.
Per-service throttle config. Documented per service; supports investigation when throttling fires.

How to throttle

The mechanisms are tool-specific. PagerDuty event orchestration supports rate limiting per service; Opsgenie has alert deduplication policies; Nova AI Ops groups by incident. Throttle by service first, then by team because cross-team throttling can hide unrelated incidents; always log throttled alerts so the on-call can see “these 14 alerts fired but were throttled during incident X”.

PagerDuty event orchestration. Rate limiting per service; the canonical mechanism.
Opsgenie deduplication policies. Same primitive in different UI.
Service-first, team-second. Cross-team throttling can hide unrelated incidents.
Always log throttled. “These 14 alerts fired but were throttled during incident X”; visibility is mandatory.

When not to throttle

Throttling has hard exceptions. Security alerts (a breach during an outage is more dangerous, not less); data-loss alerts (replication failures and backup misses must always page); new unrelated services (throttle is scoped to the incident, not blanket). The exceptions exist because the throttle’s purpose is noise reduction, not signal suppression.

Security alerts. A breach during an outage is more dangerous, not less; never throttle.
Data-loss alerts. Replication failures, backup misses; must always page.
New unrelated services. Throttle scoped to the incident; the unrelated service still pages.
Per-exception class documented. Each exception class is documented; supports correct throttle scoping.

After the incident

Post-incident hygiene matters. Resume normal alerting within 5 minutes of resolve (a 30-minute throttle window after resolve hides re-occurring issues); audit throttled alerts in the postmortem because sometimes a throttled signal pointed to a separate missed incident; if throttling fires more than weekly, alert volume during incidents is too high so tune underlying signals not the throttle window.

Resume within 5 minutes. 30-minute throttle window after resolve hides re-occurring issues.
Audit throttled in postmortem. A throttled signal sometimes pointed to a separate incident that was missed.
Weekly throttling means tune signals. Alert volume during incidents is too high; fix the underlying alerts.
Per-postmortem throttle review. Throttled alerts reviewed every postmortem; supports continuous tuning.

Apply this quarter

The application is targeted. Configure rate limiting on your top 5 services (cap at 1 page per 15 minutes per service while an incident is active); test in a game day (trigger a fake incident, fire 20 alerts, confirm only one pages); review monthly because throttling that suppresses real incidents is worse than no throttling.

Top 5 services first. Configure rate limiting; cap at 1 page per 15 minutes during incident.
Test in game day. Fire 20 alerts at a fake incident; confirm only one pages.
Monthly review. Throttling that suppresses real incidents is worse than no throttling.
Per-month adherence audit. Adherence to the throttle config measured; supports continuous accuracy.