Flaky Test Discipline
Flaky tests erode trust. The discipline.
The cost of flaky tests
Flakes destroy CI trust faster than anything else. Above 1% flake rate, "is it broken or flaky?" becomes the default question, retry-until-green becomes the default behaviour, and real bugs slip through because nobody trusts a red CI.
- Flakes destroy CI trust. Retry-until-green pattern per team; bugs slip through, broken builds get ignored.
- 1% threshold. Trust-collapse bar per CI; above 1% makes "is it broken or flaky?" the default question.
- Not bad luck. Timing- or environment-dependent reality per flake; flakes are signal about real concurrency or environment problems, not noise.
- Published trust-cost narrative per team. Documented "why flakes matter" reference supports prioritisation when flake fixes compete with feature work.
Detection
Detection is mechanical. Track per-test pass rate over 100 runs, use tooling to flag flakes automatically, manually retest with same-SHA when in doubt.
- Per-test pass rate over 100 runs. Rolling pass-rate tracking per test; below 99% is suspect.
- Tools. BuildPulse, Trunk.io, RunForCover per org integrate with CI to flag flakes automatically.
- Manual confirmation. Same-SHA 10-run retest per failure; if 2 fail, it's flaky.
- Published flake list per team. Visible top-flakers chart per team supports accountability and pulls fixes into priority.
Quarantine flakes fast
Quarantine within 24 hours of confirmation. Quarantined tests get a deadline to fix or delete, the quarantine list itself is bounded so it does not accumulate into a graveyard.
- Within 24 hours. Fast-isolation rule per flake; move to a non-blocking suite so the team is not blocked on a known-flaky test.
- Deadline: fix or delete in 2 weeks. Explicit clock per quarantined test avoids graveyard accumulation.
- Bound the quarantine list. Under-5% target per team; above that is a sign the team is giving up rather than fixing.
- Named owner per quarantined test. Responsible engineer per test catches "we forgot about it" within the deadline window.
How to fix flakes
Most flakes are timing or shared-state. Fix with explicit synchronisation, isolated fixtures, deterministic IDs. When the test surfaces a real production race, fix the bug rather than the test.
- Most flakes are timing. Implicit waits, race conditions, shared state per flake; the common pattern across most test suites.
- Fix patterns. Explicit synchronisation, isolated test fixtures, deterministic IDs per flake; the standard fixes for the standard causes.
- Real production race: fix the bug. Test-is-correct case per flake; fix the bug, not the test, when the flake surfaces real concurrency issues.
- Captured root cause per fix. Documented "why this flaked" note per flake supports later pattern recognition.
How to install the discipline
Three pieces install the discipline: track flake rate as a CI health metric, block merges when the flake threshold is exceeded, allocate explicit engineering time to flake fixes.
- Track flake rate as CI health metric. Published flake-rate chart per week shifts behaviour; visibility is the cheap part.
- Block merges over threshold. Flake-rate gate per suite blocks new tests until cleanup; the constraint produces the engineering time.
- Allocate engineering time. Explicit flake-fix allocation per quarter; flake fixes do not ship features, they need explicit prioritisation.
- Named owner per team. Responsible engineer per team catches "everyone-and-no-one" flake fixing patterns.