Quantization: Shrinking Models 4x Without Tears
Quantisation reduces model precision from 16 bits per weight to 4, sometimes 2. Done well, the model fits on cheaper hardware with single-digit-percent accuracy loss.
Why quantise
Modern LLMs ship in 16-bit precision (BF16 or FP16). A 70-billion-parameter model occupies 140 GB of GPU memory just for weights. That excludes you from anything but multi-GPU clusters.
Quantisation reduces precision to 8 bits (halving memory) or 4 bits (cutting it 4x). A 4-bit 70B model fits in 35 GB, comfortably on a single 48GB GPU, sometimes on consumer cards.
The same gain shows up in inference speed: smaller weights mean faster memory bandwidth utilisation, which is the bottleneck for autoregressive generation.
How it works, intuitively
A 16-bit float represents about 65,000 distinct values. A 4-bit integer represents 16. Quantisation maps the continuous range of weight values to a small discrete set, with a per-tensor or per-group scale factor that determines how to recover an approximation of the original.
The trick is choosing the mapping. Naive uniform quantisation (split the weight range into 16 equal bins) loses too much accuracy. Modern methods use clever non-uniform quantisation, calibration on activation data, or row-wise scale factors to preserve more signal.
INT8 vs INT4 vs NF4
- INT8: 8-bit signed integers. Roughly half the memory of FP16 with negligible accuracy loss (<1%). The safe default. Most production inference uses this.
- INT4: 4-bit integers. Quarter memory of FP16 with 1-3% accuracy loss on most benchmarks. Aggressive but increasingly default for cost-sensitive deployments.
- NF4: 4-bit “normalised float” designed for weight distributions that are roughly Gaussian. Slightly better accuracy than INT4 at similar size.
- 2-bit and 3-bit: experimental. Significant accuracy loss; only worth it for extreme memory constraints (edge devices).
GPTQ vs AWQ vs bitsandbytes
Three popular implementations:
bitsandbytes: zero-config quantisation. Loads any model in 8-bit or 4-bit at runtime. Slight accuracy loss but trivial to use. The default for “just make it fit.”
GPTQ: post-training quantisation that uses a calibration dataset to find optimal per-layer scales. Produces a quantised model file you save. More accurate than bitsandbytes; requires a one-time conversion step.
AWQ: similar to GPTQ but protects activation-sensitive weights. Often the best 4-bit accuracy for inference. Marginally newer and less widely adopted.
For most teams: bitsandbytes for development, GPTQ or AWQ for production.
Quality tradeoffs
Quantisation isn’t free. Empirical patterns from 2024-2025:
- INT8: virtually no accuracy loss across all tasks.
- INT4 / NF4: 1-3% loss on knowledge-heavy benchmarks (MMLU). Closer to 0% on chat-style tasks. Reasoning tasks (math, coding) lose more.
- Smaller base models suffer more proportionally. A 70B at 4-bit is fine; a 7B at 4-bit shows visible degradation.
- Long-context performance degrades faster than short-context as you quantise more aggressively.
If reasoning matters, stick to INT8 or test INT4 carefully against your task. If you’re running chat or simple extraction at high volume, INT4 is usually the right tradeoff.
A no-fuss recipe
- Start with INT8 (bitsandbytes load_in_8bit=True). Measure accuracy on your eval set. If acceptable, ship.
- If you need more memory savings, try INT4 (load_in_4bit=True). Measure again. If <3% loss, ship.
- If INT4 hurts too much, switch to GPTQ or AWQ with calibration on a thousand examples from your domain. This usually recovers most of the lost accuracy.
- If you’re still unhappy, accept the larger model. The cost differential at smaller scales is rarely worth a quality hit users notice.
For LoRA fine-tuning workflows, QLoRA combines 4-bit base model with full-precision adapters. This is the dominant pattern for fine-tuning under cost constraints in 2025.