Noise vs Coverage: The On-Call Trade-off
Tightening alerts reduces noise but risks missing real incidents. The framework for finding the right balance.
The cost of noise
The on-call noise vs coverage balance is the perennial trade-off in alerting. Too noisy and the on-call burns out, real alerts get missed in the noise, sleep is lost. Too quiet and customer issues go undetected; users find them before alerts do. The mature team measures both and adjusts to keep the balance.
What noise costs:
- Sleep loss.: Pages at 3 AM affect the on-call's sleep. Repeated false-alarm pages produce cumulative sleep debt; the on-call becomes less effective for the rest of the rotation. Personal cost is real.
- Alert fatigue.: When most pages are false alarms, the on-call learns to dismiss alerts without investigation. The dismissal becomes habitual; real alerts get the same treatment as noise.
- Ignored real alerts.: The most expensive failure mode of alert noise is that real alerts get ignored. The team paid for the alerting infrastructure; the alerts fired; the on-call dismissed them; the customer-impacting incident went undetected for hours.
- The expensive failure mode.: Customer-impacting incidents that should have been caught and were not are the highest-cost failure of alerting. The cost dwarfs the cost of sleep loss; the cost dwarfs the cost of the alerting infrastructure.
- Aim: real-page rate above 70%.: The metric to track is the percentage of pages that are real (vs false alarms). Above 70% is healthy; below 50% indicates the alerting needs significant tuning.
Noise has both personal and operational costs. The team measures it; the team manages it.
The cost of missed coverage
The opposite failure mode is also real and also expensive. If alerts do not fire when issues happen, customers find the issues first. The user-detected-incident rate measures this; the team aims to keep it low.
- Customer-impacting incidents detected by users instead of alerts.: A customer reports an issue; investigation reveals the issue had been ongoing for hours but no alert had fired. The team's monitoring missed it; the customer experience degraded; the incident timeline starts much earlier than the team's awareness.
- Aim: external-detected incidents below 5% of total incidents.: The metric is the percentage of incidents that were detected by users (or external monitoring) rather than internal alerts. Below 5% is healthy; above 15% indicates significant coverage gaps.
- Reputation cost.: Customers who report issues to the team feel like the team should have known. Repeated user-detected incidents erode customer trust; the cost is not just operational but reputational.
- Time-to-detect penalty.: User-detected incidents have longer time-to-detect than alert-detected. The customer's bug report takes time to reach the team; the time-to-detect adds to the time-to-resolution. Faster detection produces faster resolution.
- Postmortem signal.: Each user-detected incident is postmortemed: why was there no alert? What alert would have caught it? The remediation is tracked; the team's coverage improves over time.
Coverage gaps are quieter than noise but no less important. The metric tracks them; the postmortems remediate them.
The tune
The two metrics drive policy together. Neither can be optimized in isolation; the team measures both and adjusts based on the balance.
- Both metrics together drive policy.: Real-page rate and external-detected-incident rate are both tracked. Improving one metric at the expense of the other is not a win; the balance is what matters.
- If noise is too high, tighten.: Real-page rate below 50% indicates noise is the problem. The team tightens thresholds, adds dependencies (alert only if X and Y both happen), removes alerts that fire often without action.
- If coverage is bad, loosen or add alerts.: External-detected-incident rate above 10% indicates coverage gaps. The team loosens thresholds, adds new alerts for failure modes that were missed, improves the alert-to-incident pipeline.
- Quarterly review.: Once per quarter, the team reviews both metrics. The review produces specific tuning actions: alerts to remove, alerts to add, thresholds to adjust. The review is not optional; without it, alerts drift.
- Traffic shifts; alerts need recalibration.: The traffic patterns that were normal six months ago might be different now. Alerts calibrated for old traffic produce too many or too few pages today. The recalibration is part of the discipline.
On-call noise vs coverage balance is one of those operational disciplines that compounds across the team's operational lifetime. Nova AI Ops integrates with paging and incident data, surfaces both metrics, and produces the per-service alert tuning queue that drives the quarterly review.