Burn-Rate Alert Discipline

Burn-rate alerts catch sustained issues. The discipline that keeps them tuned.

Burn-rate alerts in one paragraph

Burn-rate alerts page when the error rate over a short window plus a long window jointly indicate budget exhaustion ahead of schedule. Pair fast (5m, 1h) and slow (6h, 1d) windows because fast catches sharp regressions and slow catches gradual erosion. Standard pairs from the Google SRE workbook: 14.4x burn for 1h triggers paging; 1x burn for 1d triggers a ticket.

Why burn rate beats raw thresholds

Raw thresholds ignore the SLO and produce mismatched noise. A 1% error rate is fine for a 99% SLO and disastrous for a 99.99% SLO; burn rate is normalised by asking how fast the monthly budget is being consumed, which makes thresholds comparable across services and reduces noise by 30-50% in most catalogs because small spikes that don’t threaten the budget are ignored.

Config patterns that work

Define SLO once and derive burn-rate rules. Sloth and Pyrra both generate Prometheus rules from a single SLO definition; use multi-window multi-burn-rate rules because single-window rules either over-alert or miss slow drift; document the SLO target inline (SLO=99.9 monthly, fast burn=14.4x for 5m+1h, slow burn=6x for 30m+6h).

How to roll out

Roll out burn-rate alerts in stages. Pick 3-5 user-facing SLOs first (don’t migrate every metric at once); run burn-rate alerts in parallel with old threshold alerts for 30 days and compare fire counts and resolved-ticket counts; cut over once the burn-rate version produces actionable pages and the old rules are demonstrably noisier.

Adopt for paging tier

Burn rate is targeted at paging tier with SLO-style signals. Skip for non-SLO signals because burn rate makes sense only for ratio-style metrics with a target; don’t apply to capacity or saturation alerts because those have different shapes; avoid burn-rate alerts during the first month of a new service because SLOs aren’t stable yet and thresholds drift.