Flaky Test Replay
Captured flake reproduced.
Capture
The reason flaky tests are hard to fix is that the failure is gone by the time you go looking for it. Hit retry, the test passes, and you move on. The bug is still there, you just lost the only useful evidence. The fix is to capture everything you would need to reproduce the failure at the moment the failure happens, not after.
What to capture, every failed run:
- Process state.: Memory, environment variables, working directory, file descriptors, open sockets, pid, and parent pid. A snapshot taken at failure tells you whether the test was racing on a shared resource, a leaked file handle, or an environment variable mutation that no other test exhibits.
- External state.: Database rows the test is reading or writing, cache entries it depends on, queue depth, mock service state, real service health (if integration). For a flaky test that only fails when the database has a specific row count, the row count at failure is the most important piece of data in the world and almost nobody captures it.
- Logs across the boundary.: Test process logs, system logs, dependent service logs, container runtime logs. The log line that reveals the race condition is rarely in the test code itself. It is in the service the test was talking to.
- Wall-clock and monotonic timing.: Time-of-day and the duration of every assertion. Many flakes are timing flakes that only fail when a particular operation crosses a threshold. The duration trace makes those visible.
Bundle all of this into a single artifact named for the test ID and the run ID, attach it to the failed CI run, and retain it for at least 30 days. The capture itself should add no more than a couple of seconds to the test run. The cost is small. The value when you actually need it is enormous.
Replay
Once the capture exists, the next move is to run the test against the captured state on a developer machine and watch it fail in slow motion. This is where the bug actually gets understood, because now you have a reproducer instead of a mystery.
- Replay should run locally.: The whole point is to get out of CI and into a debugger. Replay tooling restores the captured state into a sandbox on the developer's machine and runs the test with the original entry point against that sandbox. The developer can step through, attach a debugger, add print statements, change one variable at a time.
- Replay should be deterministic on the captured state.: The same capture, run twice, should produce the same failure. If it does not, you have not captured enough state and the capture layer needs to add what is missing. A non-deterministic replay is worse than no replay.
- Iterate on the fix in the replay loop, not in CI.: Each "what if I change this" experiment is seconds in replay versus a multi-minute CI cycle. The fix gets found 10x faster because the inner loop is fast enough to actually iterate.
- Replay is mechanical, not creative.: No guessing at causes, no "let me try adding a sleep". The captured state tells you what was true at the moment of failure. The fix follows from that, not from intuition.
Replay turns flaky-test debugging from a guessing game into a mechanical process. The first time it pays off, the team becomes loud advocates for it.
Validate
The last step is the one most teams skip: prove the fix actually works against the original failure, not just against your hopeful new test case.
- Re-run the captured state with the fix.: The same capture that originally failed should now pass with the fixed code. This is the only validation that matters. A new test you wrote that demonstrates your hypothesis is not the same as showing the original failure is gone.
- Re-run the capture against pre-fix code.: Run the original capture against the previous (broken) version of the code and confirm it fails. If it does not, your capture is incomplete and your fix may not actually be a fix.
- Add the capture as a regression test.: Once validated, the capture (or a sanitized version of its key state) becomes a permanent regression test. The next time someone reintroduces the bug, the test catches it on the original failure mode, not on a guess.
- Confidence is high.: Because the validation is mechanical (the same input that produced the failure no longer produces it), you can say "fixed" without the usual hedging. That is the difference between flaky-test triage and flaky-test resolution.
Capture, replay, validate is the difference between a team that lives with flaky tests forever and a team that retires them. Nova AI Ops captures process and external state on every CI failure, packages a replay artifact you can run locally, and tracks the cohort of validated fixes so you can see flake rate dropping per quarter instead of guessing whether it is improving.