AI & ML Practical By Samson Tanimawo, PhD Published Jul 25, 2026 4 min read

Batched Inference vs Streaming: Cost vs Latency

Batching is 5-10x cheaper. Streaming is 5-10x faster. The use cases where each wins, with concrete cost and latency numbers.

When batching wins

Asynchronous workloads. Bulk classification, bulk summarisation, periodic embedding updates. Nobody is waiting; latency tolerance is high.

5-10x cheaper than streaming. The model provider amortises GPU time across the batch.

Implementation: client.messages.batches in the Anthropic SDK or equivalent OpenAI batch endpoint. The async pattern is well-supported.

Synchronous workloads. Chat, search, real-time triage. Latency is part of the product; long waits feel broken.

Use streaming response (server-sent events). The first token arrives in milliseconds; the full response in seconds.

Implementation: standard streaming API. Most SDKs support it natively.

Streaming a 4k-token request to a frontier model: $0.05, p95 latency ~3 seconds.

Batching the same request: $0.025, latency from minutes to 24 hours depending on batch size.

Pick by user wait tolerance. If users wait, stream. If users do not wait, batch.

Pre-compute via batch overnight; serve via streaming during the day. The model output is cached; users see streaming latency on warmed cache.

Cache hits at the streaming layer mean the batch absorbed the cost; the user pays the streaming latency only.

This pattern is what most production AI products eventually arrive at.