Noise vs Coverage Frontier
More alerts catch more issues but create more noise. The trade.
The trade-off
Every alert sits on the noise-versus-coverage frontier. Tighter alerts catch more incidents but produce more noise; looser alerts produce less noise but miss real problems; moving the threshold trades one for the other. There is no globally optimal point because each service has its own ratio based on customer tolerance, team size, and traffic shape.
- Tighter catches more, noisier. The high-coverage end of the frontier; more pages.
- Looser less noisy, misses more. The low-noise end of the frontier; fewer pages, more incidents through.
- No global optimum. Each service has its own ratio; customer tolerance, team size, traffic shape drive it.
- Per-service position documented. The chosen frontier point committed; supports later tuning.
Measuring coverage
Coverage is alerts-fired over real-incidents. Track every customer-impacting incident and ask whether an alert fired before the customer noticed; coverage = (alerts that fired in time) / (real incidents); target 80% to 95% depending on service tier. Below 80% is under-alerted, above 95% is likely over-alerted.
- Customer-incident tracking. Every customer-impacting incident logged; the denominator.
- Coverage ratio. (alerts that fired in time) / (real incidents); the metric.
- 80-95% target by tier. Higher target for higher tier; the per-service target.
- Below 80% under-alerted. Above 95% likely over-alerted; both edges deserve tuning.
Measuring noise
Noise is pages-per-real-incident. 5 pages to catch one real incident is 4:1 noise; acceptable ratios are 1:1 tier 1, 2:1 tier 2, 3:1 tier 3 with above that worth tuning; track per service rather than globally because a single noisy service drags the org-wide average and hides healthy rotations.
- Pages-per-real-incident. 5 pages, 1 real incident is 4:1 noise; the basic metric.
- Tier-based ratios. 1:1 tier 1, 2:1 tier 2, 3:1 tier 3; above that, tune.
- Per-service tracking. Global average hides; per-service surfaces noisy outliers.
- Per-rotation health view. Per-service ratios show which rotations are healthy; supports fairness work.
Moving the frontier
Better signals shift the entire frontier outward. SLO-based burn-rate alerts have lower noise at equal coverage than threshold alerts; multi-signal compound alerts (errors AND latency AND traffic) shift the frontier most because single-signal alerts trade off the most; synthetic monitoring and RUM shift coverage upward without much noise cost.
- SLO burn-rate alerts. Lower noise at equal coverage than threshold alerts; the structural improvement.
- Multi-signal compound. Errors AND latency AND traffic; shifts the frontier most; the strongest pattern.
- Single-signal trades off most. The single-signal alert sits on the worst part of the frontier.
- Synthetic and RUM. Shift coverage upward without much noise cost; coverage-leaning improvements.
Apply per service
Application is per-service. Pick a service, compute current noise ratio and coverage from last quarter’s data, decide where on the frontier the team wants to be (document because it constrains future alert design), tune toward the target over the next quarter, re-measure, iterate.
- Pick a service. The unit of analysis; per-service application.
- Compute current ratio and coverage. Last quarter’s data; the baseline.
- Decide frontier point. Document because it constrains future alert design.
- Tune-measure-iterate. Tune over the next quarter, re-measure, iterate; the cadence is built in.