Error Budget Policies That Actually Get Followed
Error budgets only work if the consequence of exhausting them is real. The discipline is political; the policy is structural.
Why most policies are decoration
Most error budget policies say ‘if budget exhausted, freeze feature work.’ In practice, feature work continues; the policy lives only if it costs feature shipping when triggered.
- Stated rule. Out-of-budget services freeze non-reliability work; reads as a hard constraint.
- Actual behaviour. Feature shipping continues; the budget becomes advisory; the policy becomes decoration.
- Why it dies. No mechanical enforcement; product pressure beats SRE preference; nobody wants to be the bad guy.
- What survives. Policies with structural enforcement: CI gating, sprint planning gates, leadership scorecards.
Three enforcement mechanisms
- 1. CI gating. Out-of-budget services cannot deploy non-reliability changes.
- 2. Sprint planning gating. Affected team plans only reliability work that sprint.
- 3. Public scorecard. Budget status visible to leadership.
Getting product buy-in
Product accepts the policy because the alternative is worse: incidents that block features anyway, less predictably. Frame the budget as feature-shipping insurance, not as a brake.
- The framing. Budget is insurance; spending it on incidents costs more feature time than spending it on reliability work.
- The data. Show prior quarters; budget-respected services shipped more features than budget-violated ones.
- The trade. Product accepts predictable freezes over unpredictable incident response; engineering accepts measurable scope.
- Joint authorship. Product co-authors the policy; ownership beats edict; co-signed policies survive personnel changes.
The exception safety valve
Exceptions exist for legitimate cases: regulatory deadlines, customer contracts, security patches. Document them; cap their use; without exceptions, the policy becomes brittle and gets ignored entirely.
- When exceptions apply. Regulatory deadline, signed customer contract, critical security patch; the bar is high and named.
- The cap. Maximum N exceptions per year; once exhausted, the policy is firm again.
- Documentation. Each exception logged with reason, approver, and post-hoc review at quarter end.
- The signal. Exceptions every sprint means the SLO is wrong; rewrite the SLO before the policy collapses.
Antipatterns
- Policy with no enforcement. Decoration.
- No exception path. Brittle.
- Exceptions every quarter. Policy gone.
What to do this week
Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.