AI & ML Advanced By Samson Tanimawo, PhD Published Mar 31, 2026 6 min read

Edge ML: Quantization, Pruning, Distillation

Running ML on phones, IoT, and embedded hardware means making models 10-100x smaller. Three techniques do the heavy lifting.

The three techniques

Quantisation reduces precision (16-bit to 4-bit). Pruning removes redundant weights. Distillation trains a smaller model to mimic a larger one. They attack different sources of bloat.

Quantisation

Most cost-effective. INT8 cuts memory in half with negligible accuracy loss. INT4/NF4 cuts to a quarter with 1-3% loss. Covered in detail elsewhere; the workhorse of edge deployment.

Pruning

Removes weights below a threshold or whole neurons that contribute little. Two flavours:

Unstructured pruning: zero out individual weights. High compression but rarely faster (sparsity is hard to exploit on hardware).
Structured pruning: remove entire heads, channels, or layers. Less compression but real speedup.

For LLMs, structured pruning at the head level is the practical choice. 30-40% size reduction with 1-3% quality loss.

Distillation

Train a small student model to match a large teacher’s outputs. Surprisingly effective: a distilled 1B model often outperforms a 1B trained from scratch by 2-5 percentage points. The teacher’s softer probability distributions teach the student more than hard labels would.

Used heavily in models like Gemma-2B and Phi-3-mini. The 1-3B size class wouldn’t be useful without distillation.

Combining them

Production edge deployments often stack all three: distill from frontier to 1-3B, prune attention heads, quantise to 4-bit. Result: a 700-800 MB file that runs on a phone with 5-10 tok/s, with capability that would have been frontier-grade 2-3 years earlier.