CPU Inference: When It Actually Makes Sense
GPUs dominate ML inference. CPUs are surprisingly competitive for small models, low-volume workloads, and edge deployments where GPU economics break down.
Why consider CPU at all
GPUs are 5-10x faster on dense matrix multiplication. They’re also 5-10x more expensive per hour, often 50x more expensive when you can’t saturate them. For low-volume inference, CPU economics can flip.
When CPU wins
- Low volume: a few thousand requests per day. GPU sits idle most of the time.
- Latency-tolerant: 1-3 second response is OK. CPU can’t do sub-second on large models.
- Small models: under 7B parameters quantised to 4-bit fit comfortably in CPU memory and run at usable speed.
- Edge / on-prem: hardware that’s already paid for, no GPU available, or air-gapped environments.
How to deploy
The dominant CPU inference stack:
- llama.cpp: hand-tuned C++ inference for quantised LLMs. Runs on x86 (AVX2/AVX-512) and ARM.
- GGUF format: quantised model files (4-bit, 5-bit, 8-bit) optimised for CPU memory layout.
- Ollama / LM Studio: friendly wrappers over llama.cpp.
For server-class CPU (Xeon Sapphire Rapids, EPYC Genoa), expect 5-15 tokens/second on 7B models, 1-3 tok/s on 70B. For consumer CPU, half that.
Limits
Above a few thousand req/day or above 7B parameters, GPU economics catch up fast. Don’t default to CPU at scale; do default to it for small workloads where the GPU would idle.