SLO-Based Alerting
Alerts driven by SLO burn rate.
Idea
SLO-based alerting is the modern alternative to threshold-based alerting. Threshold alerts fire on every short spike; the on-call wakes up at 3am for noise; alert fatigue follows. SLO alerts fire only when the SLO is actually at risk; the on-call wakes up only for things that matter. The difference in noise-to-signal ratio is dramatic.
What SLO alerting actually means:
- Alert when SLO is at risk.: The alert fires based on the burn rate against the SLO budget, not on raw error rates. A burn rate of 14x for an hour is significant; a brief blip that does not threaten the budget is not. The alert distinguishes them.
- Not on every error.: Errors happen routinely. Some are actual problems; many are noise. Threshold-based alerts fire on the noise too; SLO alerts fire only when the noise is bad enough to matter to the SLO.
- Less noise; more signal.: Teams that switch from threshold-based to SLO-based alerting report 70 to 80% reduction in page volume within a quarter. The pages that remain are pages that actually matter; the pages that went away were the alert fatigue.
- Calibrated to customer impact.: The SLO is defined against customer-perceived metrics. SLO-based alerts therefore correlate with customer impact. When the alert fires, customers are at risk; when it does not, they are not. The correlation is what makes the alert trustworthy.
- Trustable signal.: The on-call learns to trust SLO alerts. Real signal produces real action. Compare with threshold alerts that produce mostly noise; the on-call learns to delay response, which makes real incidents worse.
SLO alerting is the discipline that makes on-call sustainable. Without it, on-call becomes a constant low-grade firefight; with it, on-call becomes occasional real-incident response.
Multi-window
The standard implementation of SLO alerting uses multi-window burn-rate alerts. Multiple windows at different sensitivities catch different incident shapes; the combination produces alerts that match real problems while avoiding both false alarms and missed incidents.
- 1-hour, 6-hour, 3-day windows.: The standard windows from the Google SRE book. The 1-hour window with 14x burn rate catches sharp incidents. The 6-hour window with 6x burn rate catches sustained issues. The 3-day window with 3x burn rate catches slow drift.
- All firing equals real problem.: When all three windows are firing simultaneously, the team has confirmation from three different signal levels. The incident is real; the burn rate is sustained; the response is justified. False alarms rarely fire all three.
- Different windows route to different responses.: The 1-hour window pages on-call. The 6-hour window opens a ticket and notifies the team. The 3-day window goes into the weekly review. Each timescale gets a different response.
- Standard pattern across the industry.: Modern observability platforms (Datadog, Grafana, New Relic, Honeycomb) ship multi-window burn-rate alerting as a built-in pattern. The configuration is straightforward; the value is high.
- Tunable per service.: Different services may need different thresholds. A Tier 0 service may use tighter windows and lower thresholds. A Tier 2 service may use wider windows and higher thresholds. The defaults are starting points; per-service tuning calibrates them.
Multi-window alerting is the operational mechanism that makes SLO alerts work in production. The pattern is well-understood; the implementations are mature.
Avoid
The patterns to avoid are the legacy threshold-based alerts that produce most of the alert fatigue most teams suffer from. Migrating away from these patterns is the operational improvement that yields the most benefit per unit of effort.
- Avoid alerts on raw error rate.: "Alert if error rate exceeds 1%" is the canonical bad alert. A brief spike to 5% for 30 seconds fires the alert; the spike is back to baseline before the on-call has acknowledged. The alert was noise; the on-call learned to ignore similar alerts.
- Avoid alerts on absolute thresholds.: "Alert if latency exceeds 500ms" is similar. The threshold catches the routine variability of latency; the on-call gets paged for normal noise. The fix is alerting on burn against a latency SLO, not on raw threshold.
- Avoid alerts on infrastructure metrics.: "Alert if CPU exceeds 80%" is infrastructure-level; it is not customer-facing. Many causes of high CPU do not affect customer experience; alerting on it produces noise. Alert on customer-facing SLO impact instead.
- Tie all alerts to SLO impact.: Every page-worthy alert should map to "this is currently affecting our SLO" or "this will affect our SLO if not addressed." Alerts that do not meet this bar do not warrant pages; they go to ticket queues for routine investigation.
- Migrate gradually.: Teams with extensive legacy threshold alerts cannot migrate overnight. The path is incremental: implement SLO alerts alongside the existing ones; verify SLO alerts catch the real incidents; retire the threshold alerts. The migration takes a quarter or two.
SLO alerting is the discipline that makes on-call sustainable for the long run. Nova AI Ops generates multi-window SLO alerts per service, integrates with on-call routing, and tracks the page-volume trajectory so teams can verify the alerting discipline is producing the noise reduction it should.