On-Call Noise vs Coverage
Trade-off in alert tuning.
Overview
Alert tuning is a trade-off between sensitivity and noise. Tighten thresholds and you catch more incidents earlier but live with more false positives. Loosen thresholds and the rotation calms down but coverage gaps appear. The discipline is making that trade-off explicitly per tier rather than letting it drift, and tracking false-positive rate as a first-class metric so the conversation stays grounded in data.
- Sensitivity-versus-noise trade-off. Every alert sits somewhere on this curve. Pretending otherwise is how teams end up with both noise and coverage gaps.
- Per-tier SLO thresholds. Tier-1 service alerts fire on tighter thresholds than tier-3. Priority shapes the curve.
- Per-alert false-positive rate. Tracked as a metric. Alerts above an FP threshold get reviewed or retired.
- Quarterly alert review plus coverage-gap awareness. Standing review catches drift; missed-detection postmortems surface coverage gaps.
The approach
Three habits keep alert quality high: per-tier thresholds tied to SLO priority, per-alert FP-rate tracking, and a quarterly review that prunes alerts that no longer earn their pages.
- Per-tier thresholds. Tier-1 services have the tightest thresholds; lower tiers get looser ones. Priority drives policy.
- Per-alert FP-rate tracking. The metric that exposes alerts that fire too often without action.
- Quarterly alert review. Standing meeting that prunes noisy alerts and tightens loose ones based on the data.
- Coverage-gap awareness plus documented policy. Customer-reported-first incidents flagged; per-team the alert policy lives in the runbook.
Why this compounds
Each tuned alert deposits a little more on-call quality. Retention improves; mean time to detect improves on the alerts that matter; the rotation stops being the place engineers go to burn out.
- On-call experience improves. Right balance preserves rotation health. Retention follows.
- Incident response improves. Alerts that fire mean something. Response time drops on real signals.
- Operational fit. Right policy matched to priority. The org operates the way the SLO says it should.
- Year-one investment, year-two habit. First review is heavy lift. By the fourth quarterly review, alert tuning is muscle memory.