Error Budget Formulas Cheat Sheet
Every error-budget calculation with the PromQL beside it. Designed to live next to your alert rules file, not in a doc no one reads.
The core formulas
Three numbers, that's it. Once these are computed, everything else is policy.
- Budget =
1 - SLO_target - SLI =
good_events / total_eventsover the SLO window - Budget consumed =
(1 - SLI) / (1 - SLO_target) - Budget remaining =
1 - budget_consumed - Burn rate =
(error_rate_over_window) / (1 - SLO_target)
Budget remaining
The number on the dashboard. Negative means you've already missed; pick the response policy below.
PromQL for a 99.9% availability SLO over 30 days:
- Bad rate over window:
sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])) - Budget consumed:
(sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) / (1 - 0.999) - Budget remaining:
1 - (above) - Allowed bad events in window:
(1 - 0.999) * sum(increase(http_requests_total[30d])) - Bad events observed:
sum(increase(http_requests_total{status=~"5.."}[30d]))
Burn rate
How fast the budget is being spent right now. The reference table is fixed by SLO window length and is worth memorising.
- Burn rate 1, exhausts a 30d budget in 30 days (steady state)
- Burn rate 2, exhausts in 15 days
- Burn rate 6, exhausts in 5 days (slow-burn page)
- Burn rate 14.4, exhausts in 50 hours (fast-burn page; Google's recommendation)
- Burn rate 36, exhausts in 20 hours
- Burn rate 720, exhausts in 1 hour
PromQL: (sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))) / (1 - 0.999)
Multi-window alerts
Single-window burn alerts flap. Pair a long window (catches sustained burns) with a short window (catches fresh burns). Alert fires only when both fire, kills 90% of false pages.
- Fast-burn page: 1h burn > 14.4 and 5m burn > 14.4. Pages within minutes of a real outage
- Slow-burn ticket: 6h burn > 6 and 30m burn > 6. Catches gradual degradations
- Steady warning: 24h burn > 3 and 2h burn > 3. Nice-to-know, no page
PromQL pattern (fast-burn):
(sum(rate(errors_total[1h])) / sum(rate(requests_total[1h])) / 0.001 > 14.4)and (sum(rate(errors_total[5m])) / sum(rate(requests_total[5m])) / 0.001 > 14.4)
Latency budgets
Treat slow as bad. Define a "good" event as latency ≤ threshold, and the same formulas apply.
- Good events (under 250ms):
sum(rate(http_request_duration_seconds_bucket{le="0.25"}[5m])) - Total:
sum(rate(http_request_duration_seconds_count[5m])) - SLI:
good / total - Budget consumed:
(1 - good/total) / (1 - 0.95) - Tip: include errors as bad events too, a 500 in 50ms isn't a "good" request
Policy thresholds
The numbers turn into action. Make these explicit so on-call doesn't have to negotiate them at 3am.
- Budget > 50% remaining, ship freely, run experiments, tolerate canary failures
- Budget 25-50%, normal review, no risky one-shots
- Budget 0-25%, slow rollouts, mandatory canary, change advisory
- Budget exhausted, feature freeze; reliability work only until next window
- Budget exhausted twice in a row, escalate the SLO target itself; either the SLO is wrong or the system needs investment