SLOs and On-Call Pages
On-call should map to SLO breach.
Rule
Most teams page their on-call on every error, every threshold breach, every alert that fires anywhere in their monitoring system. The result is alert fatigue: the on-call wakes up at 3am for a 30-second blip that did not affect any customer, learns to ignore the next page, and misses the real incident three weeks later. SLO-aware on-call replaces this pattern with a simple rule: pages fire when the SLO is at risk, not when any individual error happens.
What this rule actually means:
- Pages tied to SLO breach risk.: The on-call rotation is paged when the burn rate suggests the SLO is going to be breached if the situation continues. Specific incidents that consume budget but do not threaten the SLO get a ticket, not a page. The discrimination is mechanical; the burn rate is the metric.
- Not just any error.: A 5xx error that affects 0.001% of requests is noise. A 5xx rate that is pushing the budget toward exhaustion is signal. The on-call rotation responds to signal; ticket queues handle noise.
- Multi-window burn-rate alerting.: The Google SRE book defines specific alert thresholds: 14.4x burn over 1 hour (severe, urgent), 6x burn over 6 hours (significant), 3x burn over 3 days (sustained drift). Each threshold has different urgency; each routes to different responders.
- Customer-impact aware.: The pages are calibrated against what customers experience, not what the infrastructure reports. CPU at 95% does not page; customer-facing latency p99 above SLO does. The metric that matters is the one customers feel.
- Per-service tuning.: Each service has its own SLO; each service has its own page thresholds. A Tier 0 service pages on smaller burn rates; a Tier 2 service pages only on substantial budget consumption. The thresholds match the service's importance.
The rule is simple to state and harder to implement. The implementation requires real SLO instrumentation; the cultural change is bigger than the technical one.
Reduces noise
The benefit of SLO-aware on-call is dramatic noise reduction. Teams that adopt this pattern routinely report that page volume drops by 60 to 80% within a quarter. The pages that remain are real; the pages that went away were noise.
- SLO-aware alerts produce fewer false pages.: The vast majority of monitoring alerts in legacy setups are not customer-impacting. They are infrastructure noise (transient spikes, recovered failures, dependency variability). SLO-aware alerts ignore the noise and surface only the cases that threaten user experience.
- Burnout reduction.: On-call burnout is one of the largest costs in engineering. The team that gets paged 3 times per week loses sleep, loses productivity, and eventually loses people. The team paged once a month for real issues retains its on-call capacity over years.
- Trust in the page system.: When pages are real, the on-call responds urgently. When pages are noise, the on-call learns to delay response. The first behavior is what saves the SLO during real incidents; the second is what makes incidents worse.
- Faster mean time to recovery.: Counter-intuitively, fewer pages produce faster MTTR. The on-call who is rested, focused, and trusts that this page is real responds faster than the on-call who is fatigued from noise. The math is straightforward; the result is real.
- Cultural shift toward calm operation.: An organization where on-call is calm operates differently from one where on-call is constant firefighting. The calm organization invests in long-term reliability; the firefighting organization runs from incident to incident. The shift compounds.
The noise reduction is the visible benefit. The cultural shift it enables is the deeper benefit.
Signal
The flip side of noise reduction is signal preservation. The pages that fire are pages that should have fired. Each page is a real signal that warrants real response. The on-call's trust in the page system is the foundation of effective incident response.
- Page equals SLO actually at risk.: When a page fires, the SLO is genuinely threatened. The customer experience is at stake; the team's reliability commitment is at stake; the response time matters. The on-call treats every page as serious because every page is serious.
- Trustworthy alerting.: The on-call does not have to second-guess whether the page is real. The mechanical rule (page on burn rate above threshold) means the page-firing condition is verifiable. There is no "maybe it is just a flake" reaction; the burn rate either is or is not threatening.
- Justifies escalation.: When the page fires and the on-call cannot resolve quickly, escalation is justified. Pulling in a second engineer, breaking the deploy freeze, paging leadership are appropriate responses to a real signal. The infrastructure of escalation works because the trigger is trustworthy.
- Worth disturbing weekend or sleep.: A real page that fires at 3am justifies waking the on-call. Without trust in the signal, the team rationalizes "let's wait until morning to look"; with trust, the team responds immediately. The response speed is what determines the customer impact.
- Fewer real incidents missed.: The team that ignores 80% of pages because most are noise also ignores the 20% that are real. SLO-aware alerting flips the ratio; nearly 100% of pages are real, so attention to all of them produces full coverage of real incidents.
SLO-aware on-call is the pattern that turns alerting from a noise generator into a signal system. Nova AI Ops integrates SLO-aware burn-rate alerts with on-call routing, surfaces the per-page response data so the team can see whether the noise reduction is working, and tracks the on-call burden over time so the operational sustainability is measurable.