Edge ML: Quantization, Pruning, Distillation
Running ML on phones, IoT, and embedded hardware means making models 10-100x smaller. Three techniques do the heavy lifting.
The three techniques
Quantisation reduces precision (16-bit to 4-bit). Pruning removes redundant weights. Distillation trains a smaller model to mimic a larger one. They attack different sources of bloat.
Quantisation
Most cost-effective. INT8 cuts memory in half with negligible accuracy loss. INT4/NF4 cuts to a quarter with 1-3% loss. Covered in detail elsewhere; the workhorse of edge deployment.
Pruning
Removes weights below a threshold or whole neurons that contribute little. Two flavours:
- Unstructured pruning: zero out individual weights. High compression but rarely faster (sparsity is hard to exploit on hardware).
- Structured pruning: remove entire heads, channels, or layers. Less compression but real speedup.
For LLMs, structured pruning at the head level is the practical choice. 30-40% size reduction with 1-3% quality loss.
Distillation
Train a small student model to match a large teacher’s outputs. Surprisingly effective: a distilled 1B model often outperforms a 1B trained from scratch by 2-5 percentage points. The teacher’s softer probability distributions teach the student more than hard labels would.
Used heavily in models like Gemma-2B and Phi-3-mini. The 1-3B size class wouldn’t be useful without distillation.
Combining them
Production edge deployments often stack all three: distill from frontier to 1-3B, prune attention heads, quantise to 4-bit. Result: a 700-800 MB file that runs on a phone with 5-10 tok/s, with capability that would have been frontier-grade 2-3 years earlier.