Flaky Test Discipline

Flaky tests erode trust. The discipline.

The cost of flaky tests

Flakes destroy CI trust faster than anything else. Above 1% flake rate, "is it broken or flaky?" becomes the default question, retry-until-green becomes the default behaviour, and real bugs slip through because nobody trusts a red CI.

Flakes destroy CI trust. Retry-until-green pattern per team; bugs slip through, broken builds get ignored.
1% threshold. Trust-collapse bar per CI; above 1% makes "is it broken or flaky?" the default question.
Not bad luck. Timing- or environment-dependent reality per flake; flakes are signal about real concurrency or environment problems, not noise.
Published trust-cost narrative per team. Documented "why flakes matter" reference supports prioritisation when flake fixes compete with feature work.

Detection

Detection is mechanical. Track per-test pass rate over 100 runs, use tooling to flag flakes automatically, manually retest with same-SHA when in doubt.

Per-test pass rate over 100 runs. Rolling pass-rate tracking per test; below 99% is suspect.
Tools. BuildPulse, Trunk.io, RunForCover per org integrate with CI to flag flakes automatically.
Manual confirmation. Same-SHA 10-run retest per failure; if 2 fail, it's flaky.
Published flake list per team. Visible top-flakers chart per team supports accountability and pulls fixes into priority.

Quarantine flakes fast

Quarantine within 24 hours of confirmation. Quarantined tests get a deadline to fix or delete, the quarantine list itself is bounded so it does not accumulate into a graveyard.

Within 24 hours. Fast-isolation rule per flake; move to a non-blocking suite so the team is not blocked on a known-flaky test.
Deadline: fix or delete in 2 weeks. Explicit clock per quarantined test avoids graveyard accumulation.
Bound the quarantine list. Under-5% target per team; above that is a sign the team is giving up rather than fixing.
Named owner per quarantined test. Responsible engineer per test catches "we forgot about it" within the deadline window.

How to fix flakes

Most flakes are timing or shared-state. Fix with explicit synchronisation, isolated fixtures, deterministic IDs. When the test surfaces a real production race, fix the bug rather than the test.

Most flakes are timing. Implicit waits, race conditions, shared state per flake; the common pattern across most test suites.
Fix patterns. Explicit synchronisation, isolated test fixtures, deterministic IDs per flake; the standard fixes for the standard causes.
Real production race: fix the bug. Test-is-correct case per flake; fix the bug, not the test, when the flake surfaces real concurrency issues.
Captured root cause per fix. Documented "why this flaked" note per flake supports later pattern recognition.

How to install the discipline

Three pieces install the discipline: track flake rate as a CI health metric, block merges when the flake threshold is exceeded, allocate explicit engineering time to flake fixes.

Track flake rate as CI health metric. Published flake-rate chart per week shifts behaviour; visibility is the cheap part.
Block merges over threshold. Flake-rate gate per suite blocks new tests until cleanup; the constraint produces the engineering time.
Allocate engineering time. Explicit flake-fix allocation per quarter; flake fixes do not ship features, they need explicit prioritisation.
Named owner per team. Responsible engineer per team catches "everyone-and-no-one" flake fixing patterns.