Burn-Rate Alert Discipline
Burn-rate alerts catch sustained issues. The discipline that keeps them tuned.
Burn-rate alerts in one paragraph
Page when the error rate over a short window plus a long window jointly indicate budget exhaustion ahead of schedule.
Pair fast (5m, 1h) and slow (6h, 1d) windows. Fast catches sharp regressions; slow catches gradual erosion.
Standard pairs from the Google SRE workbook: 14.4x burn for 1h triggers paging; 1x burn for 1d triggers a ticket.
Why burn rate beats raw thresholds
Raw thresholds (error rate above 1%) ignore the SLO. A 1% rate is fine for a 99% SLO and disastrous for a 99.99% SLO.
Burn rate is normalized: how fast are you burning the monthly budget. Comparable across services.
Reduces noise by 30-50% in most catalogs because it ignores small spikes that don't threaten the budget.
Config patterns that work
Define SLO once, derive burn-rate rules. Sloth and Pyrra both generate Prometheus rules from a single SLO definition.
Use multi-window multi-burn-rate rules. Single-window rules either over-alert or miss slow drift.
Document the SLO target inline. SLO=99.9 monthly, fast burn=14.4x for 5m+1h, slow burn=6x for 30m+6h.
How to roll out
Pick 3 to 5 user-facing SLOs first. Don't try to migrate every metric at once.
Run burn-rate alerts in parallel with old threshold alerts for 30 days. Compare fire counts and resolved-ticket counts.
Cut over once the burn-rate version is producing actionable pages and the old rules are demonstrably noisier.
Adopt for paging tier
Skip for non-SLO signals. Burn rate makes sense only for ratio-style metrics with a target.
Don't apply to capacity or saturation alerts. Those have different shapes.
Avoid burn-rate alerts during the first month of a new service. SLOs aren't stable yet; thresholds drift.