The Acceptable-Flapping Policy for Noisy Alerts
Some alerts flap. Most are noise; some are signal. The policy that turns flapping into structured response.
Track flaps
The foundation is a concrete definition of what counts as a flap. Without one, "noisy alert" becomes a vibes-based judgement and nothing gets cleaned up. Define the threshold, auto-tag, surface the top offenders weekly.
- Concrete flap definition. "More than 3 fires in 24 hours without a sustained incident" is the working rule. Threshold is arbitrary but explicit.
- Auto-tagged flaps. Each flap auto-classified at fire time. Dashboard surfaces the top 10 candidates for cleanup.
- Every flap is a candidate. Each flapping alert gets a tune/aggregate/remove disposition. Drives explicit cleanup rather than ambient ignoring.
- Per-team flap dashboard. Each team sees its own top-flappers list. Accountability lands on the team that owns the alert.
The decision tree
Every flapping alert resolves to one of three paths. Tune the threshold if the signal is real but too tight; aggregate fires if the signal is real but transient; remove if there is no real signal underneath.
- Real signal, threshold too tight: tune. Adjust the firing threshold so the alert only fires when the underlying issue is meaningful. Most common disposition.
- Real signal, transient: aggregate. Multi-fire-into-one rule. Five flaps within 5 minutes becomes one alert with a count.
- Not a real signal: remove. The alert was wrong. Delete it; do not just suppress it.
- Documented rationale per decision. Each disposition gets a "why this" note. Future reviewers do not have to re-derive the reasoning.
Weekly review
The weekly review is what keeps the discipline alive. Without a fixed cadence, flap cleanup becomes the thing that happens after the next big incident, which is too late. Top 10, explicit decision per flap, logged so the same alert does not require re-deciding next month.
- Top 10 reviewed weekly. Bounded review scope. Each alert gets one of tune, aggregate, remove.
- Decision logged. Documented record of the disposition. Same alert flapping next month is a re-review, not a fresh decision.
- Alert pool converges to signal. Quarterly trend shows reduction toward signal-only alerts. The on-call rotation gets quieter.
- Named reviewer per week. Rotating responsible engineer. Catches "we skipped this week" before it becomes "we skipped this quarter."