Error Budget Policy
What happens when budget exhausts.
Trigger
An error budget without a policy is a number on a dashboard. The policy is what turns the number into governance: specific consequences when the budget burns, agreed to in advance by the people who will have to honor them. The policy is what gives the SLO teeth.
What a triggered policy actually does:
- Budget exhausted: feature freeze.: The team stops shipping new features to production until the budget recovers. Bug fixes, security patches, and reliability work continue. The feature work that was scheduled for this sprint moves to the next sprint or gets dropped.
- Reliability work prioritized.: The capacity that was going to feature work redirects to reliability investments: closing the gaps that caused the burn, hardening dependencies, fixing the tests that did not catch the regression. The list comes from the incident retros and the burn attribution data.
- Heightened deploy gates.: Any deploy that does happen during the freeze (reliability fixes, security patches) requires additional approval. Typically the SRE lead or engineering lead. The default deploy path becomes manually-approved instead of auto-promoted.
- Stakeholder notification.: Product and leadership are told the freeze is in effect with an estimated lift date. The notification is a fact, not a request for permission. The policy was agreed to in advance; the notification is communicating the trigger, not negotiating the response.
- Documented retro on the trigger event.: The incident or pattern that caused the burn gets a written retro within a week. The retro covers what changed, why the existing safeguards did not catch it, and what structural change will close the gap.
The policy works only when the trigger is automatic. A policy that requires a meeting to decide whether to enforce it is not a policy; it is a suggestion. Triggers fire on the data, not on judgment.
Recovery
The recovery side is just as important as the trigger. A policy with no defined exit becomes either permanent (which kills velocity) or quietly violated (which kills the policy). The recovery rules say when the freeze lifts and what conditions must be met.
- Budget recovers naturally over the SLO window.: The error budget is computed over a rolling window (28 days, calendar month, quarter). As the window slides, old incidents fall out and the budget refills. A team that was at 0% remaining will be at 50% remaining four weeks later if no new incidents land in that period.
- Or via reliability fixes.: If the cause of the burn was a structural issue (a missing test, an underprovisioned dependency, a broken alert), shipping the fix accelerates recovery. The next four weeks of expected burn drop because the issue is no longer producing incidents.
- Lift threshold defined.: The freeze lifts when the budget returns to a defined threshold (typically 50% remaining). The lift is automatic, just like the trigger was. No special meeting, no judgment call. The data hits the threshold; the policy lifts.
- Document what was learned.: A short writeup, public to engineering, summarizing what triggered the freeze, what changed during it, and what the team did differently when it lifted. This is how the policy becomes a learning tool, not a punishment.
- Adjust the SLO if necessary.: If the freeze fired three times in a year, the SLO target may be wrong. Either the architecture cannot support the committed target, or the operational investment level is too low. Recovery is the time to renegotiate the target, not the trigger.
The recovery rules give the team a clear path back to feature work. They prevent the freeze from becoming permanent and they prevent the team from learning that the policy is optional.
Avoid
The error budget policy fails in two ways: by being ignored, or by being routinely overridden. Both produce the same outcome: the SLO loses meaning, the team stops trusting the dashboard, and reliability becomes a checkbox.
- Avoid ignoring exhausted budgets.: The single biggest threat to the policy is leadership saying "we know we are out of budget but this feature is too important to wait." Once you do this once, the policy becomes optional, and an optional policy is no policy at all. Stick to the freeze.
- Erodes SLO meaning.: An SLO that has no consequences when missed is not an SLO; it is a wish. Teams stop investing in reliability because the dashboard does not actually drive decisions. The leading indicator (SLO health) decouples from the lagging indicator (customer experience), and customer-facing reliability silently degrades.
- Avoid the "this incident does not count" exception.: Excluding specific incidents from the budget calculation ("the dependency outage was not our fault") is a rabbit hole. Either the incident affected customers (in which case it counts) or it did not (in which case it should not be in the metric anyway). Carving out exceptions is how the SLO becomes meaningless.
- Avoid making the trigger negotiable.: The trigger fires automatically. Anyone (including leadership) requesting an override is asking the team to abandon the policy. The right answer is "we can renegotiate the target at the next quarterly review, but for now the policy is in effect." Resisting this conversation under pressure is what defines a real reliability practice.
- Avoid silent loosening.: Some teams quietly raise the trigger threshold (from "exhausted" to "30% remaining") to avoid firing the policy. This is the same as ignoring it but with extra steps. Either the policy is honest about when it fires or it does not provide governance.
An error budget policy with a clear trigger, a defined recovery, and a leadership-backed defense against the temptation to ignore it is the single most powerful governance tool a reliability practice has. Nova AI Ops computes the burn rate, watches for the trigger conditions, posts the freeze notice when they fire, tracks recovery progress, and lifts the freeze when the threshold is met, so the policy enforces itself rather than depending on human discipline at the worst moments.