Memory Leaks: Finding and Fixing
Memory leaks accumulate silently; investigations are painful. Process below cuts time-to-fix substantially.
Recognizing leaks
Memory leaks are a specific shape of growth, not just any rising memory line. Distinguishing them from cache or load growth is the first investigative step.
- Definition. Memory grows without bound; eventually OOM kills the process; restart cycle becomes the symptom.
- Cache growth differs. Caches plateau at the configured limit; leaks do not.
- Load growth differs. Load-driven memory tracks traffic; leaks rise even when traffic is flat.
- Restart-driven evidence. Service stable when restarted, OOMs after N hours; the time constant is the leak signature.
Four-symptom checklist
- 1. Restart cycle, service restarts every N hours.
- 2. Slow leak in dashboard.
- 3. GC frequency rising.
- 4. Latency degrading slowly over hours.
Heap-dump analysis
Heap analysis is language-specific. Each runtime has the canonical tool; learn yours before the incident, not during.
- JVM. Heap dump via
jmap; analysis with Eclipse MAT; the dominator tree shows the retainers. - Go. Heap profile via
pprof;go tool pprofvisualises allocation sites. - Python.
tracemallocfor allocation tracking;objgraphfor reference cycles. - Node.
--inspectplus Chrome DevTools heap snapshots; useful for long-lived servers.
Safer alternatives
Stop-the-world heap dumps in production are dangerous. Three alternatives surface leaks earlier and at lower cost.
- Continuous profiling. Always-on profilers (Pyroscope, Polar Signals) trend allocation over weeks; spot drift before incidents.
- Allocation sampling. Per-request memory accounting flags growth in specific code paths.
- CI memory regression. Fail the PR if the test process retains more than X memory; cheap insurance.
- Soak tests. 72-hour soak before release surfaces slow leaks before they ship to production.
Antipatterns
- Diagnosing in production with stop-the-world dumps. Outage.
- Restarting as the ‘fix.’ Treats symptom.
- Memory leaks blamed on framework. Usually app code.
What to do this week
Three moves. (1) Apply this pattern to your slowest production endpoint. (2) Measure p99 before/after. (3) Document the win and ship the runbook so the team can reproduce.