Multi-Window Burn-Rate Alerts: A Deep Dive

Multi-window burn-rate is the modern SLO alert. Mastering the math takes one afternoon and pays back forever.

The single-window failure

Single-window burn-rate alerts fire on any short blip. The math conflates "1 minute of badness" with "real budget consumption"; the result is flapping pages and on-call attrition.

One window, one bad minute. A traffic spike or a stale cache eviction trips the alert; on-call responds; nothing is wrong.
Threshold sensitivity. Lower the threshold and you miss real burns; raise it and the budget is gone before you alert.
The flapping cost. Each false page costs trust; engineers stop responding promptly; real incidents get missed.
The fix. Multi-window: require both a long and a short window to cross threshold; eliminates blips, preserves urgency.

Multi-window confirmation

The multi-window pattern uses two confirmation windows of different lengths. Both must agree before the alert fires; this is the structural fix to the single-window problem.

Long window. Confirms "real, sustained budget burn"; e.g. 1h or 6h depending on the burn-rate threshold.
Short window. Confirms "happening right now, not historical"; e.g. 5m or 30m; prevents stale alerts.
Both must agree. Either alone could be transient or stale; the AND gate is the noise filter.
The result. Pages fire only when budget is genuinely burning fast and the burn is current; on-call trust is preserved.

Threshold + window-pair math

The burn-rate threshold and window pair determine how aggressive the alert is. The SRE workbook canonical pairs are 14.4/1h and 6/6h; the math behind them is worth understanding.

14.4 burn rate over 1h. Consumes 2% of a monthly budget; alert quickly enough to act before budget exhausts.
6 burn rate over 6h. Consumes 5% of a monthly budget; the slower-burn signal that catches gradual degradation.
Why these thresholds. Match SLO review cadence; an alert at these rates leaves time to investigate and remediate.
The pair, not the threshold. Each pair has a long and short window; both must trip; this is the multi-window contract.

PromQL rule template

The Prometheus rule template ships in roughly 12 lines per SLO. Each SLO needs its own pair; do not generalise the rule globally because thresholds depend on the SLO target.

Recording rules first. Pre-compute error rate over each window; cheap to query repeatedly; the alert evaluates fast.
Alert expression. The AND of two conditions: long-window burn rate above threshold AND short-window burn rate above threshold.
Per-SLO instantiation. Threshold depends on SLO target (99.9% vs 99.95%); copy the template, swap the target.
Test before shipping. Replay historical data; verify the rule fires when expected and stays silent when not.

Antipatterns

Single-window forever. Flapping alerts.
Burn rate without an SLO. The math has no anchor.
One pair for all SLOs. Precision lost.

What to do this week

Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.