AI & ML Advanced By Samson Tanimawo, PhD Published Mar 24, 2026 5 min read

CPU Inference: When It Actually Makes Sense

GPUs dominate ML inference. CPUs are surprisingly competitive for small models, low-volume workloads, and edge deployments where GPU economics break down.

Why consider CPU at all

GPUs are 5-10x faster on dense matrix multiplication. They’re also 5-10x more expensive per hour, often 50x more expensive when you can’t saturate them. For low-volume inference, CPU economics can flip.

When CPU wins

Low volume: a few thousand requests per day. GPU sits idle most of the time.
Latency-tolerant: 1-3 second response is OK. CPU can’t do sub-second on large models.
Small models: under 7B parameters quantised to 4-bit fit comfortably in CPU memory and run at usable speed.
Edge / on-prem: hardware that’s already paid for, no GPU available, or air-gapped environments.

How to deploy

The dominant CPU inference stack:

llama.cpp: hand-tuned C++ inference for quantised LLMs. Runs on x86 (AVX2/AVX-512) and ARM.
GGUF format: quantised model files (4-bit, 5-bit, 8-bit) optimised for CPU memory layout.
Ollama / LM Studio: friendly wrappers over llama.cpp.

For server-class CPU (Xeon Sapphire Rapids, EPYC Genoa), expect 5-15 tokens/second on 7B models, 1-3 tok/s on 70B. For consumer CPU, half that.

Limits

Above a few thousand req/day or above 7B parameters, GPU economics catch up fast. Don’t default to CPU at scale; do default to it for small workloads where the GPU would idle.