AI & ML Advanced By Samson Tanimawo, PhD Published Apr 7, 2026 6 min read

On-Device LLMs: The 7B Sweet Spot

By 2026, a 7B-parameter quantised LLM runs comfortably on flagship phones and competently on laptops. The sweet spot for ‘local AI that works.’

Why 7B specifically

Models below 3B noticeably under-perform on instruction following and reasoning. Models above 13B don’t fit in mobile memory or run at usable speed. 7B is the largest size class that fits in ~4GB at 4-bit quantisation and runs at 5-15 tok/s on flagship hardware.

It’s also the size where capability meets cost: a 7B model handles 70-80% of typical chat and tool-use tasks adequately. The gap to frontier shrinks every six months.

The model lineup

Llama 3 8B / Llama 3.1 8B: the open-weight default. Strong instruction following, multilingual.
Mistral 7B / Mistral Nemo: alternative defaults. Strong reasoning per parameter.
Gemma 2-7B / Gemma 2-9B: Google’s open weights. Tightly distilled.
Phi-3 mini (3.8B): Microsoft’s data-curated approach. Punches above its weight.
Qwen 2.5 7B: strong on coding and Chinese.

Runtime stacks

llama.cpp: cross-platform, mature, the de facto standard.
Apple Core ML / MLX: Apple Silicon native. Best perf on Macs and iPhones.
ONNX Runtime: Microsoft’s cross-platform alternative.
MediaPipe / Gemini Nano: Google’s on-device path.

Where on-device wins

Privacy-critical (medical, legal, personal journaling).
Offline (transit, remote work, secure environments).
Cost-sensitive (no per-token API charges).
Latency-sensitive (no network round-trip).

The use cases compound. Apple’s, Google’s, and Microsoft’s 2026 product strategies all assume on-device 7B-class models become baseline.