Batched Inference vs Streaming: Cost vs Latency
Batching is 5-10x cheaper. Streaming is 5-10x faster. The use cases where each wins, with concrete cost and latency numbers.
When batching wins
Batching wins when the workload is asynchronous. Bulk classification, bulk summarisation, periodic embedding updates: nobody is waiting, latency tolerance is high, and the model provider amortises GPU time across the batch for 5-10x cost savings versus streaming.
- Asynchronous workloads. Bulk classification, summarisation, periodic embedding updates; latency tolerance is high.
- 5-10x cheaper than streaming. The model provider amortises GPU time across the batch; the savings are large at volume.
- Implementation.
client.messages.batchesin the Anthropic SDK or equivalent OpenAI batch endpoint; the async pattern is well-supported. - Per-batch sizing. Larger batches amortise better but extend completion time; tune per workload.
When streaming wins
Streaming wins when latency is part of the product. Chat, search, real-time triage; long waits feel broken to the user. Server-sent events deliver the first token in milliseconds and the full response in seconds, which is what the synchronous experience requires.
- Synchronous workloads. Chat, search, real-time triage; latency is part of the product.
- Server-sent events. First token arrives in milliseconds; the full response in seconds.
- Implementation. Standard streaming API; most SDKs support it natively.
- Per-token UX. Token-by-token rendering keeps the user engaged; the perceived latency is lower than wall-clock latency.
Cost and latency numbers
The numbers drive the choice. Streaming a 4k-token request to a frontier model lands around $0.05 with p95 latency of about 3 seconds; batching the same request costs roughly $0.025 with latency from minutes to 24 hours depending on batch size. The user’s wait tolerance is the deciding factor.
- Streaming cost-latency. 4k-token request to a frontier model: $0.05, p95 latency ~3 seconds.
- Batching cost-latency. Same request via batch: $0.025, latency from minutes to 24 hours depending on batch size.
- Decision rule. If users wait, stream; if users do not wait, batch.
- Per-workload TCO. Mixed workloads benchmark both modes against the actual access pattern; supports cost-conscious architecture.
Hybrid patterns
Hybrid patterns capture both savings and latency. Pre-compute via batch overnight, serve via streaming during the day; cache hits at the streaming layer mean the batch absorbed the cost and the user pays only the streaming latency. This is what most production AI products eventually arrive at.
- Batch overnight, stream by day. Pre-compute via batch; serve via streaming on warmed cache; the cost-latency frontier shifts in the team’s favour.
- Cache hits absorb cost. The batch paid the GPU time; the user sees streaming latency without the streaming cost.
- Production convergence. Most production AI products arrive at this pattern; the hybrid is the steady-state.
- Per-cache-key hit-rate target. Pre-compute coverage tracked as a hit-rate metric; supports continuous cost reduction.