Noise vs Coverage Frontier
More alerts catch more issues but create more noise. The trade.
The trade-off
Tighter alerts catch more incidents but produce more noise. Looser alerts produce less noise but miss real problems.
Every alert sits on the noise/coverage frontier. Moving the threshold trades one for the other.
There is no globally optimal point. Each service has its own ratio based on customer tolerance, team size, and traffic shape.
Measuring coverage
Track every customer-impacting incident. For each, ask: did an alert fire before the customer noticed?
Coverage = (alerts that fired in time) / (real incidents). Target 80% to 95% depending on service tier.
Below 80% coverage: under-alerted; missing real problems. Above 95% coverage: likely over-alerted.
Measuring noise
Pages-per-real-incident ratio. If you page 5 times to catch one real incident, your noise is 4:1.
Acceptable ratios: 1:1 for tier 1 services, 2:1 for tier 2, 3:1 for tier 3. Above that, tune.
Track per service, not globally. A single noisy service drags the org-wide average and hides healthy rotations.
Moving the frontier
Better signals move the entire frontier outward. SLO-based burn-rate alerts have lower noise at equal coverage than threshold alerts.
Multi-signal compound alerts ("errors AND latency AND traffic") shift the frontier most. Single-signal alerts trade off the most.
Synthetic monitoring shifts coverage upward without much noise cost. Real-user monitoring (RUM) is similar.
Apply per service
Pick a service. Compute current noise ratio and coverage from last quarter's data.
Decide where on the frontier the team wants to be. Document this; it constrains future alert design.
Tune toward the target over the next quarter. Re-measure; iterate.