CI Test Isolation
Tests must be isolated.
Data
The single biggest source of test flake in any CI system is data pollution between tests. Test A inserts a row, test B reads it, test C truncates the table, test D runs in a different order than yesterday and now sees state it did not write. Each test individually looks correct. The suite, run in any order, fails intermittently. The fix is non-negotiable test data isolation.
What proper data isolation looks like:
- Each test gets fresh data.: Every test starts with a known, controlled state and ends having cleaned up. No reliance on what previous tests left behind. The test reads its own fixture, writes its own rows, and tears them down or runs inside a transaction that rolls back.
- Per-test database or schema.: The strongest isolation is a fresh database per test, or at least a fresh schema. Modern test runners spin these up in seconds using lightweight engines (sqlite in memory, postgres in a container). The cost is real but bounded; the payoff in test reliability is enormous.
- Transactional rollback when fresh-DB is too expensive.: Run each test inside a transaction that rolls back at the end. This works for most read-write tests against relational databases without requiring a fresh DB per test. The trade-off is that tests cannot exercise commit-time behavior or test interactions across transactions.
- Unique IDs per test.: Where shared databases are unavoidable, use namespaced IDs that include the test ID or a UUID. Test A's user ID 1 cannot collide with test B's user ID 1 because they were created with different prefixes. The mechanical separation prevents the most common pollution failure.
- Shared read-only fixtures only.: The only state tests can share is read-only reference data (currency tables, timezone lists, hardcoded enums). Anything that any test mutates is per-test by construction.
The discipline sounds excessive until the team experiences a flaky CI suite. Then it becomes the cheapest investment they ever made.
Environment
Data is the easiest source of test pollution to fix. Environment is the harder one. Tests can leak through shared filesystems, shared sockets, shared memory, shared timing, shared external services. The strongest answer is hard environment isolation per test.
- Per-test container or namespace.: Each test runs in its own container, its own Linux namespace, or its own process tree. Filesystem, network, IPC, all isolated by default. The test cannot accidentally write to a global path or open a port another test is listening on.
- Network isolation.: Tests should not share localhost ports. Either bind to ephemeral ports per test, or run each test in a network namespace where its localhost is private. The "address in use" error in CI is almost always a missing isolation boundary.
- Filesystem isolation.: Per-test temporary directories. Tests must not write to a shared /tmp path. Per-test working directories that are mode 700 and torn down after the test.
- Time-based isolation.: Tests that depend on the wall clock are fragile. Inject the clock as a dependency so the test can move it forward, freeze it, or reset it. Real wall-clock dependencies cause time-of-day flakes that are nearly impossible to debug.
- External service mocking, by default.: Tests do not call real third-party services. They call mocks or contract-tested fakes. The only place real third-party calls happen is a small set of dedicated integration tests that are explicitly allowed to be slower and flakier.
The investment in environment isolation looks like overkill until the team tries to debug a flaky test that only fails when test 47 in the suite happens to coincide with test 412 in another worker. At that point, the cost of hard isolation becomes obviously cheaper than the cost of the flake.
Flaky
The reason we go through all this trouble is that most "flaky tests" are not flaky tests. They are real tests catching real isolation failures, and the team is treating the symptom by retrying instead of fixing the cause. Investment in isolation pays back as a steep drop in the ambient flake rate.
- Most flakes come from poor isolation.: Order dependencies, leftover state, time-of-day races, port collisions, shared mocks. Each one is a real bug; each one is hidden by the team retrying the failed run. The right response to a flake is investigation, not retry.
- Track flake rate per test, not just per suite.: A specific test that fails 15% of the time is probably an isolation bug. The fix is concentrated. A suite that fails 5% of the time spread across 200 different tests is a structural isolation problem and needs a system-level response.
- Random test order, in CI.: Run tests in random order on every CI run. Order-dependent flakes surface immediately because they no longer hide behind the consistent default order. The first 20 random orders catch most of the offenders.
- Quarantine, then fix, then return.: A flaky test gets pulled from the main run into a quarantine bucket within 24 hours of detection. The team has a deadline to fix it (one sprint, max). After fix, it returns to the main run. Tests that linger in quarantine deeper than the deadline get deleted, not silently kept around.
- Investment compounds.: Every flake closed by a real fix prevents N future flakes that would have hit the same isolation bug. Good isolation is the gift that keeps giving across years and across new tests.
The teams that take isolation seriously have CI suites that pass 99%+ of the time on green code. The teams that retry around the problem have CI suites that pass 80% on green code, which is functionally indistinguishable from broken. Nova AI Ops tracks flake rate per test, identifies isolation-dependent failures (order-sensitivity, port collisions, shared state) automatically, and quarantines flaky tests so the suite signal stays meaningful.