Memory Fragmentation
Cause of OOM.
Overview
Memory fragmentation produces apparent OOM even when the kernel reports free pages. The allocator’s heap is fragmented; large allocations fail despite available memory. Recognising fragmentation as a distinct failure mode is what stops the “just give it more memory” reflex.
- Cause of OOM. Fragmented heap fails large allocations even with free pages. RSS alone does not surface the problem.
- Allocator matters. glibc malloc, jemalloc, tcmalloc all behave differently. Workload pattern selects the right one.
- Long-lived processes. Fragmentation accumulates over time. Short-lived processes rarely see it; long-running ones do.
- Workload shape plus periodic restart. Mixed allocation sizes fragment more; periodic restart resets the heap, sometimes the simplest fix.
The approach
Three habits keep memory fragmentation under control: choose the allocator deliberately, monitor fragmentation as a standing signal, and restart periodically when fragmentation is the dominant failure mode.
- Choose allocator. jemalloc fits many long-running workloads; tcmalloc shines on multi-threaded ones. Default glibc malloc is fine for short-lived processes.
- Monitor fragmentation.
/proc/PID/mapsandglibc M_MMAP_THRESHOLDsurface fragmentation. Watch the trend over the process lifetime. - Restart periodically. Daily or weekly restart resets fragmentation cleanly. Sometimes simpler than tuning the allocator.
- Pre-allocate buffers plus documented policy. Buffer reuse reduces fragmentation; per-service allocator and restart policy lives in the runbook.
Why this compounds
Each correctly-tuned process produces ongoing stability for the lifetime of the deployment. The team learns to recognise fragmentation as a class of failure rather than treating every OOM as a memory shortage.
- Reduced OOM incidents. Right allocator and restart policy shrink the OOM-incident class. Memory limits stop being a recurring outage source.
- Better resource utilisation. Less waste from fragmentation means smaller instance types become viable.
- Engineering quality. Buffer reuse produces cleaner code. Performance work and code clarity align.
- Year-one investment, year-two habit. The first allocator change is investment. By the third service, the team picks the allocator at design time.