Error Budget Formulas Cheat Sheet

Every error-budget calculation with the PromQL beside it. Designed to live next to your alert rules file, not in a doc no one reads.

The core formulas

Three numbers, that's it. Once these are computed, everything else is policy.

Budget = 1 - SLO_target
SLI = good_events / total_events over the SLO window
Budget consumed = (1 - SLI) / (1 - SLO_target)
Budget remaining = 1 - budget_consumed
Burn rate = (error_rate_over_window) / (1 - SLO_target)

Budget remaining

The number on the dashboard. Negative means you've already missed; pick the response policy below.

PromQL for a 99.9% availability SLO over 30 days:

Bad rate over window: sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d]))
Budget consumed: (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) / (1 - 0.999)
Budget remaining: 1 - (above)
Allowed bad events in window: (1 - 0.999) * sum(increase(http_requests_total[30d]))
Bad events observed: sum(increase(http_requests_total{status=~"5.."}[30d]))

Burn rate

How fast the budget is being spent right now. The reference table is fixed by SLO window length and is worth memorising.

Burn rate 1, exhausts a 30d budget in 30 days (steady state)
Burn rate 2, exhausts in 15 days
Burn rate 6, exhausts in 5 days (slow-burn page)
Burn rate 14.4, exhausts in 50 hours (fast-burn page; Google's recommendation)
Burn rate 36, exhausts in 20 hours
Burn rate 720, exhausts in 1 hour

PromQL: (sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - 0.999)

Multi-window alerts

Single-window burn alerts flap. Pair a long window (catches sustained burns) with a short window (catches fresh burns). Alert fires only when both fire, kills 90% of false pages.

Fast-burn page: 1h burn > 14.4 and 5m burn > 14.4. Pages within minutes of a real outage
Slow-burn ticket: 6h burn > 6 and 30m burn > 6. Catches gradual degradations
Steady warning: 24h burn > 3 and 2h burn > 3. Nice-to-know, no page

PromQL pattern (fast-burn):

(sum(rate(errors_total[1h])) / sum(rate(requests_total[1h])) / 0.001 > 14.4)
and (sum(rate(errors_total[5m])) / sum(rate(requests_total[5m])) / 0.001 > 14.4)

Latency budgets

Treat slow as bad. Define a "good" event as latency ≤ threshold, and the same formulas apply.

Good events (under 250ms): sum(rate(http_request_duration_seconds_bucket{le="0.25"}[5m]))
Total: sum(rate(http_request_duration_seconds_count[5m]))
SLI: good / total
Budget consumed: (1 - good/total) / (1 - 0.95)
Tip: include errors as bad events too, a 500 in 50ms isn't a "good" request

Policy thresholds

The numbers turn into action. Make these explicit so on-call doesn't have to negotiate them at 3am.

Budget > 50% remaining, ship freely, run experiments, tolerate canary failures
Budget 25-50%, normal review, no risky one-shots
Budget 0-25%, slow rollouts, mandatory canary, change advisory
Budget exhausted, feature freeze; reliability work only until next window
Budget exhausted twice in a row, escalate the SLO target itself; either the SLO is wrong or the system needs investment