Error Budget Policy Template
Specific template for error budget policy.
Trigger
An error budget policy is the document that turns a number on a dashboard into governance. The key piece is the trigger: the specific, measurable condition that switches the team from feature work to reliability work. If the trigger is fuzzy, the policy is fuzzy, and feature pressure wins every time.
The triggers that hold up under real pressure:
- Budget exhaustion.: The clearest trigger and the easiest to defend. When the rolling 28-day error budget hits zero, the policy fires. There is no judgment call. The dashboard shows the number, the policy shows the action.
- 25% budget remaining (early warning).: Many teams use a softer trigger before exhaustion to avoid the cliff effect. "When 75% of the budget is consumed in the current window, the next deploy must be reliability-focused, not feature-focused." This gives the team time to course-correct before they are out of budget.
- Burn rate threshold.: A 14x burn rate sustained for 1 hour will exhaust a 28-day budget in under 2 days even if you started full. Burn-rate triggers fire on the trajectory, not the absolute number, and catch the cases where you are about to be in trouble even though the dashboard still looks green.
- Multi-incident pattern.: Three customer-impacting incidents in a rolling 30 day window, regardless of total minutes burned. This catches the case where lots of small incidents are eroding trust without consuming much budget on paper.
The trigger is the unambiguous signal. Whatever you pick, write it down and commit to it before you are inside the moment. Defining a trigger during an incident does not work.
Response
The response to a fired trigger is what makes the policy real. The actions have to be specific enough that nobody can negotiate them down at the moment of pressure, and contained enough that the team can actually execute them.
- Feature freeze on the affected service.: No new feature deploys to production until the budget recovers or the team completes the reliability sprint. Bug fixes, security patches, and reliability work continue. This is the load-bearing piece. Without an actual freeze, the policy has no teeth.
- Reliability sprint.: The team's next planning cycle is dedicated to closing the contributor gaps the budget burn revealed. Specific items, not "general reliability work." The list comes from the incident retros and the burn-attribution data.
- Heightened deploy gates.: Any deploy that does happen during the freeze period requires additional approval (typically the SRE lead or eng lead, not just code review). Deploys of reliability fixes still get scrutiny because reliability fixes also break things.
- Incident retro on every incident, not just severe ones.: During a budget-exhausted window, every incident gets a written retro and an action item. This raises the friction on shipping changes that cause incidents and builds the documentation of what is actually consuming the budget.
- Stakeholder notification.: Product and leadership are told the freeze is in effect and given an estimated lift date. The notification is a fact, not a request for permission. The policy was agreed to in advance precisely so this conversation does not have to happen during the incident.
The response is the part most teams underspecify. "We'll prioritize reliability" is not a response. "Feature deploys halt, the next sprint is reliability-only, every incident gets a retro" is.
Recovery
The exit ramp from the policy matters as much as the entry. A policy with no recovery clause turns into permanent feature paralysis or, more commonly, gets quietly violated when the team can not stand the freeze any more.
- Budget restoration is the lift trigger.: When the rolling-window budget returns to a safe threshold (typically 50% remaining), feature deploys resume. The lift is automatic, just like the freeze was. No special meeting required.
- Reliability sprint completion gates the lift on long incidents.: If a structural issue caused the burn (a missing test category, an unmonitored dependency, an undersized cluster), the lift waits for the fix to ship, not just for the budget to recover on its own. Otherwise the next quarter starts with the same hidden risk.
- Document what was learned.: A short writeup, public to engineering, summarizing what triggered the freeze, what changed during it, and what the team did differently when it lifted. This is how the policy becomes a learning tool, not just a punishment.
- Adjust the SLO or the policy if necessary.: If a freeze fires three times in a year, the SLO target may be wrong, or the dependencies may not actually support the commitment, or the operational investment level is too low. The recovery is the right time to renegotiate. Adjust deliberately, not in the heat of the moment.
An error budget policy with a clear trigger, a specific response, and a defined recovery is one of the highest-leverage governance documents an engineering team can write. Nova AI Ops computes the burn rate, watches for the trigger conditions, posts the freeze notice when they fire, tracks reliability-sprint progress against the contributor list, and lifts the freeze when the recovery thresholds are met, so the policy enforces itself.