Best Practices Intermediate By Samson Tanimawo, PhD Published Mar 31, 2026 6 min read

An Error Budget Policy Template That Survives Politics

An error budget without a written policy is a number nobody enforces. A policy without teeth is theatre. Here is the minimum text that turns a budget into governance.

Why a policy at all

A budget is just a number. A policy is the agreement of what happens when the number runs low. Without one, you will negotiate the same conversation every quarter, and the team that argues hardest will win each time.

The structural issue: error budget without policy creates a number that's nominally important but practically ignorable. Engineers see the budget burning and ask "should we slow down?" Product sees the burn and asks "but we have features to ship." Without a policy that pre-decided what to do, the conversation is political — whoever has more leverage wins, and the team's behaviour becomes inconsistent.

A policy converts the political question to a procedural one. "Below 10% remaining we halt feature work" doesn't require a meeting; it's the rule. The policy itself was negotiated once; the situations are decided automatically. That's what makes the budget a real tool rather than decorative metric.

Four required sections

Every workable policy has four parts: how we measure, what triggers escalation, what triggers a halt on feature work, and the exception clause. Less than that and the policy is decorative.

The four-section structure is what makes the policy enforceable. Skip any one and you create ambiguity that the next political conversation exploits. The team that wrote a policy with three sections instead of four discovers six months later that the missing section is exactly where the leverage gets used.

Each section has a specific job. Section 1 prevents arguments about whether the budget was actually exceeded. Section 2 creates progressive consequences before the halt. Section 3 is the teeth. Section 4 is the exit hatch that prevents the policy from being used as a weapon. All four are required; cutting one weakens the policy disproportionately.

Section 1: how we measure

Define the SLO formally. Specify the time window, the SLI definition, and where the data comes from. "We measure availability as the fraction of successful requests over the last 28 days, sourced from the load balancer access logs." That sentence eliminates ten future arguments.

The arguments the formal definition prevents: "But that's not really availability — what about timeouts?" "Should we exclude internal traffic?" "What about requests that returned 4xx because of bad input?" "Why 28 days and not 30?" Each is a legitimate question; each becomes unanswerable without a written definition; each becomes a 30-minute meeting when the budget runs low.

The data-source clause is critical. "Sourced from the load balancer access logs" is different from "sourced from the application's internal metrics" — they'll show different numbers because they're measuring at different points. Pick one; commit to it; document the choice.

Section 2: escalation gates

At 50% budget remaining, the on-call lead reviews the burn weekly. At 25%, the engineering manager reviews daily and pauses any non-critical changes. At 10%, the VP is informed. These thresholds are political, not technical, and they need names attached.

The escalation creates progressive engagement. Without it, teams either ignore the budget until it's gone (no early warning) or treat every burn as a crisis (alarm fatigue). The 50/25/10 ladder gives leadership multiple opportunities to intervene before the halt; the ladder is what makes the eventual halt politically defensible.

The names matter. "The engineering manager reviews daily" is different from "engineering reviews daily" — the named person is accountable. If the policy says "the team reviews," nobody actually reviews; if it names the EM, the EM owns the review. Always name a person or specific role.

Section 3: when feature work halts

The teeth. Below 10% budget remaining, all non-reliability work pauses until either the budget recovers or the SLO is formally renegotiated. This is the clause leadership will try to soften ("can we just pause a few projects?"). Don't soften it. The whole point of a budget is the trigger.

The reason the trigger must be hard: a soft trigger gets negotiated. "We're at 8% but this feature is really important" becomes the conversation every quarter. With a hard trigger, the conversation becomes "we're at 8% so feature work is paused; here's what we're doing to recover" — different conversation, different outcomes.

What "halt feature work" means in practice. Reliability work continues (fixing the things that consumed the budget). Customer-impacting bug fixes continue. Routine operational work continues. NEW features are paused. The discipline is precise: "non-reliability NEW work" pauses, not "everything."

Section 4: exception clause

The exit hatch. Major external events (security incident, regulatory deadline) can override the halt with a written justification and a recovery plan. Without this clause, the policy gets ignored the first time something genuinely urgent collides with it.

The exception is what makes the policy survive. Without it, the team faces an impossible choice the first time a regulatory deadline lands during a budget halt: violate the policy or miss the deadline. Either choice damages the policy's credibility. The exception clause solves both: the regulatory work proceeds, the policy survives.

The exception's discipline. Written justification (the reason is documented). Recovery plan (how the team will rebuild budget after the exception). Approval from the named role (usually VP of engineering). Without these, the exception becomes a back door that everyone uses.

Signoffs that matter

The VP of engineering and the head of product. If only one of them signs, the other will treat the policy as advisory.

The signoff dynamic. Engineering wants the policy because it gives them the authority to halt feature work. Product wants the policy because it forces engineering to rebuild reliability before adding load. Both signing means both have committed; neither can later claim "I didn't agree to this."

What happens without product's signoff: the first budget halt, product VP escalates to the CEO arguing engineering is blocking revenue features. Without joint signoff, the engineering VP loses the political argument. With joint signoff, the policy is a company-level commitment that the CEO can't unilaterally override.

Common antipatterns

The "soft halt." Below 10%, feature work "should be reduced." Vague verbs lose to specific deadlines every time. Use "halts" and "paused"; not "reduced" or "considered."

The policy that nobody reviews. Written, signed, filed in a wiki, never read. Every budget conversation re-derives the policy from scratch. Schedule quarterly reviews of the policy itself; surface it.

The policy that's actually three policies. Different services, different SLOs, different policies. Sometimes necessary, but each additional policy multiplies the political surface. Most teams should have one error-budget policy with per-service SLO targets, not 12 separate policies.

The exception clause without rigor. "We can override the halt for important work." Anyone, anytime. The exception is what ate the policy. Always require: written justification, recovery plan, named approver.

What to do this week

Three moves. (1) Draft the four sections for your most important service. Don't try to perfect it; first draft in a half-day, polish in review. (2) Get the VP of engineering and head of product on a 30-minute call. Walk them through the policy. Surface their concerns; revise. (3) Sign and publish. Most policies die in revision rounds; ship the v1, iterate based on actual use rather than imagined cases.