Test Flakiness Budget
Cap on flaky tests. Forcing fixing.
What a flake budget is
The flake budget is the discipline of capping how flaky CI is allowed to get. Above the cap, new test additions stop until cleanup; the discipline is what prevents flakes from compounding into a silent test culture where everyone re-runs by default.
- Maximum acceptable flake percent. Explicit org-level cap. Above it, no new tests merge until cleanup brings the team back inside the budget.
- Typical budget: one percent. The canonical bar. One percent of CI runs experience a flake; below that, devs still trust failures, above it they start ignoring them.
- Forces ownership. "Add the flake, fix the flake" rule. Avoids the tragedy-of-the-commons pattern where everyone benefits from the test suite and nobody maintains it.
- Named budget owner. One responsible engineer per team. Catches the "everyone-and-no-one" flake-fixing pattern that shows up under sustained schedule pressure.
Measuring flakes
Measurement starts with re-runs on the same SHA. Pass-on-retry is the flake signal; per-suite tracking surfaces the worst offenders so cleanup time targets the highest-leverage fixes first.
- Re-run on same SHA. Retry every failure once. A pass on the retry against the same commit marks the run as a flake; deterministic failures stay red.
- Per-suite flake rate. Tracked rate per suite. Browser and integration suites are inherently flakier than unit; comparing them against a single bar misleads.
- Tools. BuildPulse, Trunk.io, and GitHub's flaky-test detection. Standard set; pick whichever integrates with the existing CI without bolt-on glue.
- Published flake dashboard. Visible chart per team. Accountability follows visibility; private metrics get gamed or ignored.
When you blow the budget
The over-budget response is automatic and pre-agreed. Stop new tests, quarantine the worst offenders, and allocate explicit cleanup time so the recovery does not depend on a project manager remembering to schedule it.
- Halt new test additions. New-test-merge block. Existing tests continue to run; new ones wait until the flake count drops back inside the budget.
- Quarantine offenders. Move the worst flakers to a non-blocking suite. They keep running for signal; they no longer break the merge gate.
- Allocate engineering time. Explicit quarterly flake-fix allocation. Flake fixes do not ship features and will lose every prioritisation argument unless the time is reserved upfront.
- Visible over-budget banner. Team-level status indicator. Catches the "we are over but no one noticed" failure mode that turns a budget into theatre.
Preventing new flakes
Prevention beats cleanup. Code-review checks, pre-merge multi-run, and a dedicated test-quality reviewer for tier-one services keep the flake-add rate below the cleanup rate.
- Code review checks. Explicit synchronisation, no-sleep, and deterministic-test-data on every test PR. Standard checklist items rather than discovery during review.
- Pre-merge multi-run. Run new tests ten times in CI before merge. Surfaces flakes before they land where rolling them back is expensive.
- Dedicated test-quality reviewer. Named reviewer per tier-one service. Catches subtle test-quality issues that a feature-focused review misses.
- Documented timing assumptions. A short "what this test depends on" note per test. Future debugging becomes possible without re-deriving the assumptions from scratch.
How to set the budget
Set the budget from the current baseline. Tighter for unit tests, looser for end-to-end, published weekly so the bar drifts down rather than the metric drifting up.
- Baseline from current rate. Starting budget halves the current rate over six months. Realistic enough that the team can hit it; aggressive enough to drive change.
- Tighter for unit tests. Under 0.1 percent target per suite. Achievable for unit tests; teams that miss this bar usually have hidden infrastructure problems.
- Looser for end-to-end. One to five percent acceptable per suite. Browser and integration are inherently flakier; the budget acknowledges reality.
- Publish weekly. Budget-versus-current chart shared in the team channel each week. Visibility drives the cleanup; private metrics rarely move.