Time-Based Alert Throttling: Catching the 3am Spam Without Losing Signal
Throttling is not silencing; it is ‘tell me once.’ The patterns prevent the alert from firing 50 times for the same condition.
Throttle vs silence
Silence: never fire. Throttle: fire once, suppress duplicates for N minutes.
Throttle is what you want for ‘the alert is real but on-call already knows.’
Four throttling patterns
- 1. Group_wait. Wait N seconds for related alerts before sending the first.
- 2. Group_interval. Send updates only every N minutes.
- 3. Repeat_interval. Re-send the same alert if still firing after N hours.
- 4. Inhibit_active. Suppress while a parent alert is active.
Per-tier throttle settings
SEV1: group_wait 30s, repeat 4h. SEV2: group_wait 1min, repeat 12h. SEV3: group_wait 5min, repeat 24h.
The pattern: more severe = shorter wait, more frequent re-page; less severe = longer wait, rarer re-page.
Distinguishing repeat-signal from noise
Repeat-signal is fine, ‘still degraded’ every 4 hours wakes the right person at the right cadence.
Noise-on-loop is when the alert flaps every 30 seconds. The fix is the alert tuning (add a for: clause), not just throttling.
Antipatterns
- Repeat_interval too short. Same alert wakes you 6 times.
- Repeat_interval too long. Forgotten incidents.
- Throttling without grouping. 50 individual alerts for one cluster outage.
What to do this week
Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.