Noise vs Coverage Frontier

More alerts catch more issues but create more noise. The trade.

The trade-off

Every alert sits on the noise-versus-coverage frontier. Tighter alerts catch more incidents but produce more noise; looser alerts produce less noise but miss real problems; moving the threshold trades one for the other. There is no globally optimal point because each service has its own ratio based on customer tolerance, team size, and traffic shape.

Measuring coverage

Coverage is alerts-fired over real-incidents. Track every customer-impacting incident and ask whether an alert fired before the customer noticed; coverage = (alerts that fired in time) / (real incidents); target 80% to 95% depending on service tier. Below 80% is under-alerted, above 95% is likely over-alerted.

Measuring noise

Noise is pages-per-real-incident. 5 pages to catch one real incident is 4:1 noise; acceptable ratios are 1:1 tier 1, 2:1 tier 2, 3:1 tier 3 with above that worth tuning; track per service rather than globally because a single noisy service drags the org-wide average and hides healthy rotations.

Moving the frontier

Better signals shift the entire frontier outward. SLO-based burn-rate alerts have lower noise at equal coverage than threshold alerts; multi-signal compound alerts (errors AND latency AND traffic) shift the frontier most because single-signal alerts trade off the most; synthetic monitoring and RUM shift coverage upward without much noise cost.

Apply per service

Application is per-service. Pick a service, compute current noise ratio and coverage from last quarter’s data, decide where on the frontier the team wants to be (document because it constrains future alert design), tune toward the target over the next quarter, re-measure, iterate.