Best Practices Intermediate By Samson Tanimawo, PhD Published Apr 21, 2026 7 min read

Multi-Window Burn-Rate Alerts: Why Single Thresholds Always Fail

Burn-rate alerts replace static thresholds with a dynamic view of how fast you are eating your error budget. Multi-window combines fast and slow burns so the page is short, sharp, and right.

Why thresholds fail

Pick any error-rate threshold and you can write a failure mode. Two percent for fifteen minutes wakes you for a single bad deploy you would rather see in the morning. Five percent for an hour misses a slow leak that drains your monthly budget by Friday. Static thresholds answer the wrong question. The question is not "is the rate above X?" The question is "are we going to run out of budget before the period ends?"

The deeper issue: static thresholds treat all error states identically regardless of whether the budget can absorb them. A 3% error rate on a service with 99% availability target is fine for a few minutes; the same rate on a 99.99% target service is catastrophic. The threshold has no knowledge of the budget context, so it fires the same way in both cases. Engineers learn to distrust the alert, which is how chronic noise begins.

Burn-rate alerting fixes this by making the alert condition relative to the budget. Instead of "X% errors," the condition becomes "we are consuming budget at a rate that will exhaust it before the SLO window ends." That formulation aligns the alert with the actual business consequence — running out of budget — rather than a proxy that may or may not correlate.

Errors as a budget

An SLO of 99.9% over thirty days gives you 43 minutes of unavailability. That is the budget. Once you have it, every burnt minute is an entry in the same ledger as a bad deploy, a database hiccup, or a noisy neighbour. Spend it consciously.

The budget framing changes engineering culture. "Are we within budget?" becomes a meaningful question that engineers can answer; "are we reliable enough?" was always vague. Teams with explicit budgets argue less about "is this an acceptable level of disruption?" because the budget is the answer.

The trick is making the budget visible. Most teams compute the SLO but never display the remaining minutes. A dashboard tile that says "12 of 43 minutes burnt this month" is more actionable than 99.97% availability. Engineers who see the budget remaining make different choices about deploy timing, dependency upgrades, and risk tolerance.

Burn rate, in two sentences

Burn rate is how fast you are eating your budget relative to the steady-state pace. A burn rate of 1 means you finish the budget exactly at the end of the window. A burn rate of 14.4 means you exhaust your full 30-day budget in two days.

The 14.4 number is not arbitrary — it represents 100% budget consumption in 1/14.4 of the window. The Google SRE Workbook uses 14.4 as the canonical "fast burn" threshold because it gives the team enough time to mitigate before customers experience cumulative budget depletion. Teams that customise this often pick 10x or 20x; the exact number matters less than picking ONE and tuning around it.

The intuition pump: if your service uses 1 hour of error budget per minute of incident, the burn rate is 60. At burn rate 60, the monthly budget is gone in 43 minutes. Burn rate is the slope at which the budget line trends to zero.

Why two windows beat one

Fast windows catch fast incidents but flap on every blip. Slow windows are stable but page hours after the user noticed. The trick is to require both: page only when the short window AND the long window agree. The short window says "right now is bad." The long window says "we are in a sustained problem, not a transient blip."

The mathematical justification: requiring both windows above a burn-rate threshold filters out transients. A 30-second flap might briefly hit 14.4x burn on the 5-minute window, but it can't hit 14.4x on the 1-hour window unless it lasts. The dual-window AND condition is a cheap, principled way to debounce.

The alternative single-window approaches all have failure modes. A pure short window flaps on every transient. A pure long window pages too late. Hysteresis (high-low thresholds) prevents flapping but introduces complexity and edge cases. Multi-window-multi-burn-rate is well-understood, mathematically grounded, and easier to reason about than ad-hoc filters.

A working configuration

A common pair: page on 14.4x burn over 1h and 14.4x burn over 5m simultaneously (catches sudden severe incidents); separately page on 6x burn over 6h and 6x over 30m (catches slow leaks). Two rules, four expressions, no flapping.

Why 14.4x and 6x? 14.4x burn over 1 hour means you'd consume the entire 30-day budget in about 50 hours — a true emergency. 6x burn over 6 hours means you'd consume the budget in about 5 days — slow enough to investigate, fast enough that ignoring it loses you the budget by week's end. The two thresholds catch the two distinct incident shapes: sudden catastrophic and slow leak.

The Prometheus implementation is straightforward. Two recording rules compute the burn rate at each window. Two alert rules combine them with AND. The full config is roughly 30 lines of YAML, including comments. Most teams paste this once and never touch it again.

How to tune it

Run the rules silent for two weeks against your real metrics. If they fire fewer than 5 times a week, ship them. More than that, your SLO is wrong, not the alerting math. Tighten the SLO before you loosen the alert.

The "tighten the SLO not the alert" rule is non-obvious. The instinct when an alert fires too often is to relax the alert thresholds. But the math of multi-window-multi-burn-rate is principled; the only way it fires "too often" is if the underlying SLO is unrealistic for your current service quality. The honest move is to admit the SLO doesn't match reality and adjust the SLO target.

What "tightening the SLO" means in practice: lower the availability target until your service actually meets it 95% of months, set the error budget accordingly, and let burn-rate alerting fire when you're at risk of NOT meeting that achievable target. A 99.5% SLO that's met is more useful than a 99.9% SLO that's missed every other month.

Common antipatterns

Single-window burn-rate alerts. Looks like burn-rate but isn't. Either flaps (short window only) or pages too late (long window only). Always pair short and long.

Different thresholds for different services. The team that customises 14.4x to 12x for one service and 18x for another loses the math's elegance and gains config drift. Use the canonical values; adjust the SLO instead.

Burn-rate alerts on metrics that aren't customer-facing. CPU burn rate, database burn rate. The framework is for SLI burn rates, not for symptom metrics. Symptoms get classic threshold alerts; SLIs get burn-rate alerts.

Skipping the silent-run period. Teams ship burn-rate alerts without testing them against historical data; the alerts fire 30 times the first week and the team disables them. The silent-run period is what catches calibration mistakes before the team loses trust.

What to do this week

Three concrete moves. (1) Identify your top 3 customer-facing SLIs and their associated SLOs. If the SLOs are aspirational ("we'd like 99.99%") and not currently met, set realistic SLOs first; burn-rate alerting on aspirational SLOs is mathematical theatre. (2) Implement the two-rule pair (14.4x/1h+5m and 6x/6h+30m) in silent mode for those 3 SLIs. Run for 2 weeks. (3) Promote to paging only the SLI/rule combinations that fire fewer than 5 times during the silent period. Anything firing more often signals an SLO mismatch — fix the SLO, then promote.