SLO Intermediate By Samson Tanimawo, PhD Published Dec 19, 2026 9 min read

Error Budget Burn-Rate Alerts: The Math Behind Modern SLOs

If your alerts still fire at "error rate above 1%", you are alerting on weather, not climate. Multi-window burn-rate alerts are the upgrade, and they are worth the math.

The threshold-alert problem

"Page when error rate exceeds 1% for 5 minutes." Every team has written this rule. It alerts on transient blips that recover before anyone reads the page, and it stays silent during a slow week-long erosion that consumes the entire month's error budget. The threshold has no relationship to your SLO.

The diagnosis. Threshold rules answer "is the system experiencing errors right now." That is not the question that should wake you up. The question that should wake you up is "are we burning the error budget faster than the SLO can absorb." Different math, different alert.

Burn rate, defined

Your SLO is a budget over a window, say, 99.9% success over 30 days. The error budget is what you can spend: 0.1% × 30 days = 43.2 minutes of bad in a month. Burn rate is the speed at which you are spending it.

A burn rate of 1.0 means "consuming budget at exactly the rate that hits zero by end-of-window." A burn rate of 14.4 means "at this pace you exhaust the entire monthly budget in 2 hours." A burn rate of 0.1 means "spending one-tenth the allowed pace; everything is fine."

The formula. Burn rate = (current error rate) / (1 - SLO target). For a 99.9% SLO, burn rate = error_rate / 0.001. A 1% error rate is a burn rate of 10, fast.

The fast-burn / slow-burn pair

Two alerts, two characters. Fast-burn catches outages: budget gone in hours. Threshold: burn rate > 14.4 over a 1-hour window, page someone immediately. Slow-burn catches erosion: budget gone in days. Threshold: burn rate > 6 over a 6-hour window, file a ticket, no page.

The 14.4 number is not arbitrary. It is the rate that consumes 2% of a 30-day budget in 1 hour, which is roughly the threshold above which most teams agree "this is an outage." The 6.0 number consumes 5% in 6 hours, the rate at which you should know about it before the weekend.

The point of two alerts. Fast-burn protects against acute incidents. Slow-burn protects against the silent week of slightly-elevated 5xx that consumes the whole budget. Threshold-only alerts catch one and miss the other.

Multi-window confirmation

The honest improvement on the basic burn-rate alert is the multi-window AND. Fire only when both a long window (e.g. 1h) and a short window (e.g. 5m) cross the threshold. The long window confirms "this is real, not a blip." The short window confirms "this is happening right now, not a flapping artifact."

Without multi-window, your fast-burn alert fires every time a single backend hiccups for 30 seconds. With it, the alert is high-signal, every page corresponds to something a human should look at.

A copy-paste Prometheus example

Assuming an SLO of 99.9% over 30 days, here is the fast-burn alert as a PromQL expression suitable for Alertmanager.

# Fast-burn: 14.4x for 1h, confirmed by 14.4x for 5m
- alert: SLOFastBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > (14.4 * 0.001)
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[5m]))
      / sum(rate(http_requests_total[5m]))
    ) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "SLO fast burn, 2% of monthly budget in 1h"

# Slow-burn: 6x for 6h, confirmed by 6x for 30m
- alert: SLOSlowBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[6h]))
      / sum(rate(http_requests_total[6h]))
    ) > (6 * 0.001)
    and
    (
      sum(rate(http_requests_total{status=~"5.."}[30m]))
      / sum(rate(http_requests_total[30m]))
    ) > (6 * 0.001)
  for: 15m
  labels:
    severity: ticket
  annotations:
    summary: "SLO slow burn, 5% of monthly budget in 6h"

Replace 0.001 with (1 - your-slo-target). Replace the metric name with whatever you actually emit. The structure stays the same.

Antipatterns

One alert per service, no SLO. Burn-rate math requires an SLO. Without one, you are still doing thresholds. Define the SLO first, then the alert.

Alerting on burn rate without acting on it. The point of burn-rate alerts is to make the budget visible enough that the team negotiates feature work vs reliability work. If you fire alerts and ignore them, the math has not helped.

Excluding planned-outage minutes from the budget. Tempting; corrosive. Users do not care that your maintenance was planned. Include the time; lower the SLO if needed.

What to do this week

Three moves. (1) Pick one critical service; write a 30-day SLO target you can defend. (2) Replace the threshold alert on that service with the fast-burn / slow-burn pair above. (3) Add a budget-remaining graph to the on-call dashboard so the question "do we have budget for risky work" has an instant answer.