On-Device LLMs: The 7B Sweet Spot
By 2026, a 7B-parameter quantised LLM runs comfortably on flagship phones and competently on laptops. The sweet spot for ‘local AI that works.’
Why 7B specifically
Models below 3B noticeably under-perform on instruction following and reasoning. Models above 13B don’t fit in mobile memory or run at usable speed. 7B is the largest size class that fits in ~4GB at 4-bit quantisation and runs at 5-15 tok/s on flagship hardware.
It’s also the size where capability meets cost: a 7B model handles 70-80% of typical chat and tool-use tasks adequately. The gap to frontier shrinks every six months.
The model lineup
- Llama 3 8B / Llama 3.1 8B: the open-weight default. Strong instruction following, multilingual.
- Mistral 7B / Mistral Nemo: alternative defaults. Strong reasoning per parameter.
- Gemma 2-7B / Gemma 2-9B: Google’s open weights. Tightly distilled.
- Phi-3 mini (3.8B): Microsoft’s data-curated approach. Punches above its weight.
- Qwen 2.5 7B: strong on coding and Chinese.
Runtime stacks
- llama.cpp: cross-platform, mature, the de facto standard.
- Apple Core ML / MLX: Apple Silicon native. Best perf on Macs and iPhones.
- ONNX Runtime: Microsoft’s cross-platform alternative.
- MediaPipe / Gemini Nano: Google’s on-device path.
Where on-device wins
- Privacy-critical (medical, legal, personal journaling).
- Offline (transit, remote work, secure environments).
- Cost-sensitive (no per-token API charges).
- Latency-sensitive (no network round-trip).
The use cases compound. Apple’s, Google’s, and Microsoft’s 2026 product strategies all assume on-device 7B-class models become baseline.