CPU Inference: When It Actually Makes Sense
GPUs dominate ML inference. CPUs are surprisingly competitive for small models, low-volume workloads, and edge deployments where GPU economics break down.
Why consider CPU
GPUs aren't always available. CPU instances are everywhere. CPU inference is slower per dollar at most workloads; for some workloads (low QPS, latency-tolerant, small models), CPU economics actually win. The narrative "you must use GPUs" overstates the GPU advantage; many workloads run fine on CPU and avoid GPU scheduling complexity.
The availability advantage. CPU instances are abundant in every cloud and every region. GPU instances queue; the latest models queue longer. For teams that need predictable capacity, CPU removes the supply-chain risk.
The integration advantage. Most existing infrastructure is CPU-based. Adding GPU requires new hardware procurement, new scheduling, new monitoring, new everything. CPU inference fits into existing stacks without infrastructure overhaul.
The cost advantage at low QPS. GPU instances are sized for high throughput. At low queries-per-second, the GPU is mostly idle; you pay for capacity you're not using. CPU inference scales down naturally; you pay only for what you use.
The simplicity advantage. CPU inference has no driver compatibility issues, no CUDA versions, no GPU memory management. The operational complexity drops substantially. For small teams, the simplicity is worth a lot.
When it wins
Several patterns where CPU is the right call:
- Small models (≤7B parameters with quantisation; ≤13B with newer instruction-following models that don't need full precision).
- Low QPS, fewer than ~10 requests per second per replica. GPU is sized for higher throughput; CPU economics dominate at low rates.
- Latency-tolerant, first-token latency under 1 second is acceptable. CPU inference is slower per token; if 500ms-2s is fine, CPU works.
- Edge/on-device, phones, laptops, embedded devices. GPU may not exist; CPU is the only option.
The small-model case in detail. Quantised 7B models fit in 4-8GB and run on commodity laptops. With GGUF quantisation (Q4 or Q5), inference is 5-30 tokens/second on modern CPUs. For chat applications with mostly short outputs, the throughput is acceptable.
The low-QPS case in detail. A GPU instance costs $1-3/hour. CPU instance costs $0.05-0.50/hour. At low QPS, the GPU's capacity is wasted; the CPU's lower throughput is sufficient. The crossover happens around 5-15 QPS depending on model size.
The latency-tolerant case in detail. CPU first-token latency is 200-800ms; per-token throughput is 10-30 tokens/second. For batch processing, async work, low-stakes UI ("loading..." spinner is OK), this is plenty.
The edge case in detail. Mobile and laptop deployment is CPU (or sometimes integrated GPU; not data-center GPU). Models running locally on user devices are CPU by definition; the question is which models can fit and run usefully.
How to deploy
The toolchain matters. llama.cpp (and its many forks) is the standard for CPU inference of quantised LLMs. ggml/GGUF format. AVX2/AVX-512 instructions; ARM NEON for ARM cores. For batched server-side: vLLM with CPU backend, or specialised servers like LM Studio's headless mode.
The quantisation choice. Q4 (4-bit weights) is the default for CPU; smallest model size and fastest inference. Q5 and Q6 trade speed for quality. F16 (no quantisation) is roughly 4x slower than Q4 on CPU and rarely worth the quality trade-off for production CPU inference.
The instruction-set choice. Modern CPUs (Intel Xeon, AMD EPYC, AWS Graviton) support AVX2 and AVX-512. AVX-512 is roughly 2x faster than AVX2 for matmul. Verify your target CPUs support AVX-512; if not, AVX2 is the fallback.
The serving stack. Single-request serving: llama.cpp's HTTP server is sufficient. Batched serving for higher QPS: build on vLLM CPU backend or use specialised tools. The serving stack matters most at high QPS; below ~5 QPS, naive serving is fine.
The monitoring. CPU inference latency is sensitive to neighbour workloads (memory bandwidth contention). Track p50/p95/p99 latency; alert on tail-latency degradation. CPU inference is more vulnerable to noisy-neighbour effects than GPU.
Limits
Don't try to run 70B+ models on CPU for production. The throughput is too low (1-5 tokens/second on commodity CPUs); the cost-per-token doesn't compete with GPU; latency is unacceptable for interactive use. CPU inference is a tool for the right size of model; using it for the wrong size produces bad outcomes.
The hard ceiling. Models above ~13B parameters generally don't run usefully on CPU. The throughput is too low for production. Specialised hardware (GPU, dedicated inference chips) is required for >13B at scale.
The latency ceiling. CPU first-token latency is hard to push below 200ms. For UX requiring sub-100ms response, CPU isn't viable. Move to GPU or accept the latency cost.
The throughput ceiling. CPU inference at production scale is rarely above 50-100 tokens/second total per server. For workloads needing higher throughput per server, CPU isn't economic; GPU's higher throughput justifies its higher cost.
The future trajectory. CPU inference performance is improving (instruction set extensions, memory bandwidth improvements, specialised inference cores in CPUs). The gap with GPU narrows; doesn't close. CPU inference's role expands but doesn't displace GPU for high-throughput workloads.
Common antipatterns
CPU inference for high-QPS production traffic. Throughput per dollar loses to GPU at scale. Use CPU for low-QPS, batch, or edge.
Running a 70B model on CPU "to save money". The throughput is so low that per-token cost is higher than GPU. Match model size to inference target.
Skipping quantisation. F16 CPU inference is wasted compute. Quantise to Q4 or Q5 for production.
Not measuring tail latency. CPU latency is bursty under load. Without p99 monitoring, you'll miss the cases that hurt users.
What to do this week
Three moves. (1) For any low-QPS LLM workload, model whether CPU inference is cheaper. The threshold is around 5-15 QPS for 7B-13B models; below that, CPU often wins. (2) If you have edge deployment plans (laptop, mobile, embedded), test the largest quantised model that fits and runs at acceptable speed. The capability ceiling for edge is what determines your edge product's scope. (3) Add CPU inference as a fallback for GPU outages. When GPU capacity becomes constrained, CPU inference at degraded performance is much better than 503 errors.