Error Budgets Explained: From Theory to Real Team Use
An error budget is simple math, it's the organisational behaviour around it that's hard. Here's what changes in a team that actually uses one.
The math in 30 seconds
If your SLO is 99.9% over 28 days, your error budget is 0.1% of requests in that window. For a service doing 10M requests/28 days, that is 10,000 “allowed” failures before you blow the budget.
Spend too few and you're being overly cautious, you could have shipped faster. Spend too many and reliability drops below the promise you made to the business. A healthy budget gets used.
What a budget actually buys you
The error budget is not a stick for punishing the team. It is a permission slip. As long as you are under budget, you can:
- Run chaos experiments in production
- Do aggressive deploys (canary jumps, region moves, infra swaps)
- Refactor a risky service without weeks of prep
When the budget runs out, those activities pause until the next window, freeing the team to focus on fixing the reliability gap instead.
Writing an error-budget policy
The policy is a document, signed by engineering and product, that says “here is exactly what happens at each budget level.” A reasonable starter template:
- >50% remaining: normal deploys, feature work, experiments allowed.
- 25,50% remaining: require a second reviewer on risky changes; defer non-critical experiments.
- <25% remaining: feature freeze. Only reliability-improving changes deploy. Daily review of the burn rate.
- Budget exhausted: stop deploys except for security/rollback. Engineering leads schedule a reliability sprint.
Burn-rate alerts, not state alerts
Alerting on “SLO violated” is too late. Alert on burn rate, how fast you're consuming the budget.
Two alerts cover nearly every case: a fast-burn alert (2% of the monthly budget consumed in 1 hour) and a slow-burn alert (10% of the monthly budget consumed in 6 hours). The first catches spikes; the second catches slow regressions.
The part that is really about politics
The math is easy. The hard part is convincing product leadership that a feature freeze at 80% through the month is the right response. That conversation is much easier when the policy was signed before the budget ran out, not after.
Teams that succeed with error budgets treat the policy as a pre-commitment. Teams that fail treat it as advisory.
The math is easy. The organisational behaviour around it is hard.
How to run the first month
Start the month with a clean budget. Post it in a Slack channel with the on-call engineers. Watch the burn rate, not the remaining percentage.
If the budget ends the month with more than 60% remaining, your target is too loose and you are underinvesting in shipping speed. If it ends under 20%, your target is too tight and you are underinvesting in reliability. Adjust quarterly, not monthly.
The policy should be read aloud at the next all-hands when the budget first runs out. Seeing product and engineering agree, in public, that the freeze is happening is what sets the precedent for every quarter after.