Burn-Rate Alert Discipline
Burn-rate alerts catch sustained issues. The discipline that keeps them tuned.
Burn-rate alerts in one paragraph
Burn-rate alerts page when the error rate over a short window plus a long window jointly indicate budget exhaustion ahead of schedule. Pair fast (5m, 1h) and slow (6h, 1d) windows because fast catches sharp regressions and slow catches gradual erosion. Standard pairs from the Google SRE workbook: 14.4x burn for 1h triggers paging; 1x burn for 1d triggers a ticket.
- Two-window joint condition. Short and long windows must agree; the joint test bounds noise.
- Fast catches sharp. 5m, 1h windows; sharp regressions surface here.
- Slow catches drift. 6h, 1d windows; gradual erosion surfaces here.
- Google SRE pairs. 14.4x burn for 1h triggers paging; 1x burn for 1d triggers a ticket.
Why burn rate beats raw thresholds
Raw thresholds ignore the SLO and produce mismatched noise. A 1% error rate is fine for a 99% SLO and disastrous for a 99.99% SLO; burn rate is normalised by asking how fast the monthly budget is being consumed, which makes thresholds comparable across services and reduces noise by 30-50% in most catalogs because small spikes that don’t threaten the budget are ignored.
- Raw thresholds ignore SLO. 1% fine for 99% SLO, disastrous for 99.99% SLO; mismatched.
- Burn rate normalised. How fast are you burning the monthly budget; comparable across services.
- 30-50% noise reduction. Small spikes that don’t threaten budget are ignored.
- Per-SLO threshold derivation. Threshold computed from the SLO; the rule is principled.
Config patterns that work
Define SLO once and derive burn-rate rules. Sloth and Pyrra both generate Prometheus rules from a single SLO definition; use multi-window multi-burn-rate rules because single-window rules either over-alert or miss slow drift; document the SLO target inline (SLO=99.9 monthly, fast burn=14.4x for 5m+1h, slow burn=6x for 30m+6h).
- Single SLO definition. Sloth and Pyrra generate Prometheus rules from one SLO definition.
- Multi-window multi-burn-rate. Single-window either over-alerts or misses slow drift.
- Inline SLO documentation. SLO=99.9 monthly, burn rate values; the context is in the rule.
- Per-SLO version control. The SLO definition lives in git; supports the audit trail.
How to roll out
Roll out burn-rate alerts in stages. Pick 3-5 user-facing SLOs first (don’t migrate every metric at once); run burn-rate alerts in parallel with old threshold alerts for 30 days and compare fire counts and resolved-ticket counts; cut over once the burn-rate version produces actionable pages and the old rules are demonstrably noisier.
- 3-5 user-facing SLOs first. Don’t try to migrate every metric at once.
- Parallel run for 30 days. Compare fire counts and resolved-ticket counts.
- Cut over on demonstrated improvement. Burn-rate produces actionable pages; old rules demonstrably noisier.
- Per-SLO migration documented. Each migration captured with before/after metrics; supports continued investment.
Adopt for paging tier
Burn rate is targeted at paging tier with SLO-style signals. Skip for non-SLO signals because burn rate makes sense only for ratio-style metrics with a target; don’t apply to capacity or saturation alerts because those have different shapes; avoid burn-rate alerts during the first month of a new service because SLOs aren’t stable yet and thresholds drift.
- Paging tier with SLO. Ratio-style metrics with a target; the right surface.
- Skip non-SLO signals. Burn rate doesn’t apply; raw thresholds are fine elsewhere.
- Skip capacity and saturation. Different shapes; static thresholds work better.
- Wait one month for new service. SLOs not stable; burn-rate too noisy in the first month.