Multi-Window Burn-Rate Alerts: A Deep Dive
Multi-window burn-rate is the modern SLO alert. Mastering the math takes one afternoon and pays back forever.
The single-window failure
Single-window burn-rate alerts fire on any short blip. The math conflates "1 minute of badness" with "real budget consumption"; the result is flapping pages and on-call attrition.
- One window, one bad minute. A traffic spike or a stale cache eviction trips the alert; on-call responds; nothing is wrong.
- Threshold sensitivity. Lower the threshold and you miss real burns; raise it and the budget is gone before you alert.
- The flapping cost. Each false page costs trust; engineers stop responding promptly; real incidents get missed.
- The fix. Multi-window: require both a long and a short window to cross threshold; eliminates blips, preserves urgency.
Multi-window confirmation
The multi-window pattern uses two confirmation windows of different lengths. Both must agree before the alert fires; this is the structural fix to the single-window problem.
- Long window. Confirms "real, sustained budget burn"; e.g. 1h or 6h depending on the burn-rate threshold.
- Short window. Confirms "happening right now, not historical"; e.g. 5m or 30m; prevents stale alerts.
- Both must agree. Either alone could be transient or stale; the AND gate is the noise filter.
- The result. Pages fire only when budget is genuinely burning fast and the burn is current; on-call trust is preserved.
Threshold + window-pair math
The burn-rate threshold and window pair determine how aggressive the alert is. The SRE workbook canonical pairs are 14.4/1h and 6/6h; the math behind them is worth understanding.
- 14.4 burn rate over 1h. Consumes 2% of a monthly budget; alert quickly enough to act before budget exhausts.
- 6 burn rate over 6h. Consumes 5% of a monthly budget; the slower-burn signal that catches gradual degradation.
- Why these thresholds. Match SLO review cadence; an alert at these rates leaves time to investigate and remediate.
- The pair, not the threshold. Each pair has a long and short window; both must trip; this is the multi-window contract.
PromQL rule template
The Prometheus rule template ships in roughly 12 lines per SLO. Each SLO needs its own pair; do not generalise the rule globally because thresholds depend on the SLO target.
- Recording rules first. Pre-compute error rate over each window; cheap to query repeatedly; the alert evaluates fast.
- Alert expression. The AND of two conditions: long-window burn rate above threshold AND short-window burn rate above threshold.
- Per-SLO instantiation. Threshold depends on SLO target (99.9% vs 99.95%); copy the template, swap the target.
- Test before shipping. Replay historical data; verify the rule fires when expected and stays silent when not.
Antipatterns
- Single-window forever. Flapping alerts.
- Burn rate without an SLO. The math has no anchor.
- One pair for all SLOs. Precision lost.
What to do this week
Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.