Test Flakiness Budget

Cap on flaky tests. Forcing fixing.

What a flake budget is

The flake budget is the discipline of capping how flaky CI is allowed to get. Above the cap, new test additions stop until cleanup; the discipline is what prevents flakes from compounding into a silent test culture where everyone re-runs by default.

Maximum acceptable flake percent. Explicit org-level cap. Above it, no new tests merge until cleanup brings the team back inside the budget.
Typical budget: one percent. The canonical bar. One percent of CI runs experience a flake; below that, devs still trust failures, above it they start ignoring them.
Forces ownership. "Add the flake, fix the flake" rule. Avoids the tragedy-of-the-commons pattern where everyone benefits from the test suite and nobody maintains it.
Named budget owner. One responsible engineer per team. Catches the "everyone-and-no-one" flake-fixing pattern that shows up under sustained schedule pressure.

Measuring flakes

Measurement starts with re-runs on the same SHA. Pass-on-retry is the flake signal; per-suite tracking surfaces the worst offenders so cleanup time targets the highest-leverage fixes first.

Re-run on same SHA. Retry every failure once. A pass on the retry against the same commit marks the run as a flake; deterministic failures stay red.
Per-suite flake rate. Tracked rate per suite. Browser and integration suites are inherently flakier than unit; comparing them against a single bar misleads.
Tools. BuildPulse, Trunk.io, and GitHub's flaky-test detection. Standard set; pick whichever integrates with the existing CI without bolt-on glue.
Published flake dashboard. Visible chart per team. Accountability follows visibility; private metrics get gamed or ignored.

When you blow the budget

The over-budget response is automatic and pre-agreed. Stop new tests, quarantine the worst offenders, and allocate explicit cleanup time so the recovery does not depend on a project manager remembering to schedule it.

Halt new test additions. New-test-merge block. Existing tests continue to run; new ones wait until the flake count drops back inside the budget.
Quarantine offenders. Move the worst flakers to a non-blocking suite. They keep running for signal; they no longer break the merge gate.
Allocate engineering time. Explicit quarterly flake-fix allocation. Flake fixes do not ship features and will lose every prioritisation argument unless the time is reserved upfront.
Visible over-budget banner. Team-level status indicator. Catches the "we are over but no one noticed" failure mode that turns a budget into theatre.

Preventing new flakes

Prevention beats cleanup. Code-review checks, pre-merge multi-run, and a dedicated test-quality reviewer for tier-one services keep the flake-add rate below the cleanup rate.

Code review checks. Explicit synchronisation, no-sleep, and deterministic-test-data on every test PR. Standard checklist items rather than discovery during review.
Pre-merge multi-run. Run new tests ten times in CI before merge. Surfaces flakes before they land where rolling them back is expensive.
Dedicated test-quality reviewer. Named reviewer per tier-one service. Catches subtle test-quality issues that a feature-focused review misses.
Documented timing assumptions. A short "what this test depends on" note per test. Future debugging becomes possible without re-deriving the assumptions from scratch.

How to set the budget

Set the budget from the current baseline. Tighter for unit tests, looser for end-to-end, published weekly so the bar drifts down rather than the metric drifting up.

Baseline from current rate. Starting budget halves the current rate over six months. Realistic enough that the team can hit it; aggressive enough to drive change.
Tighter for unit tests. Under 0.1 percent target per suite. Achievable for unit tests; teams that miss this bar usually have hidden infrastructure problems.
Looser for end-to-end. One to five percent acceptable per suite. Browser and integration are inherently flakier; the budget acknowledges reality.
Publish weekly. Budget-versus-current chart shared in the team channel each week. Visibility drives the cleanup; private metrics rarely move.