Batched Inference vs Streaming: Cost vs Latency
Batching is 5-10x cheaper. Streaming is 5-10x faster. The use cases where each wins, with concrete cost and latency numbers.
When batching wins
Asynchronous workloads. Bulk classification, bulk summarisation, periodic embedding updates. Nobody is waiting; latency tolerance is high.
5-10x cheaper than streaming. The model provider amortises GPU time across the batch.
Implementation: client.messages.batches in the Anthropic SDK or equivalent OpenAI batch endpoint. The async pattern is well-supported.
When streaming wins
Synchronous workloads. Chat, search, real-time triage. Latency is part of the product; long waits feel broken.
Use streaming response (server-sent events). The first token arrives in milliseconds; the full response in seconds.
Implementation: standard streaming API. Most SDKs support it natively.
Cost and latency numbers
Streaming a 4k-token request to a frontier model: $0.05, p95 latency ~3 seconds.
Batching the same request: $0.025, latency from minutes to 24 hours depending on batch size.
Pick by user wait tolerance. If users wait, stream. If users do not wait, batch.
Hybrid patterns
Pre-compute via batch overnight; serve via streaming during the day. The model output is cached; users see streaming latency on warmed cache.
Cache hits at the streaming layer mean the batch absorbed the cost; the user pays the streaming latency only.
This pattern is what most production AI products eventually arrive at.