The Soak Test That Catches Memory Leaks
Most leaks ship to production because soak tests are too short. The 72-hour test, the metrics to watch, and the leaks it has actually caught.
Why 72 hours
Memory leaks have time constants. Short tests pass and lie; long tests reveal the slow drift that ships to production unnoticed.
- 1-hour test. Passes regardless; the heap has not had time to drift.
- 8-hour test. Catches some leaks; the fast-growing ones surface; slow drifts still hide.
- 72-hour test. Catches most production-relevant leaks; the time constant lines up with weekend production drift.
- Diminishing returns. Beyond 72 hours, returns drop fast; the 72-hour bar is the cost-effective ceiling.
What to watch
Four signals tell you whether the run is healthy. Each one catches a different leak shape; missing any of them lets that shape through.
- RSS memory. Should asymptote, not grow linearly; linear growth is the classic leak signature.
- Open file descriptors. Should stabilise; growing FDs leak just as badly as growing memory.
- Goroutine / thread count. Unbounded growth in Go goroutines or JVM threads is a leak; track the count.
- GC pause time. Should be steady; growing pauses indicate heap pressure that may have a leak underneath.
Make it part of the release
The soak test only catches anything if release blocks on it. Without enforcement, it becomes a doc nobody reads.
- Release-blocking. Soak failure blocks the release; no override without a written waiver and named approver.
- Run on RC. Soak runs against the release candidate, not main; the artifact going to production is what gets tested.
- Cost realism. 72 hours of compute per release is cheap compared to one production memory leak.
- Trend tracking. Per-release soak metrics trended over time; gradual drift in the baseline is itself a signal.