Batched Inference vs Streaming: Cost vs Latency

Batching is 5-10x cheaper. Streaming is 5-10x faster. The use cases where each wins, with concrete cost and latency numbers.

When batching wins

Batching wins when the workload is asynchronous. Bulk classification, bulk summarisation, periodic embedding updates: nobody is waiting, latency tolerance is high, and the model provider amortises GPU time across the batch for 5-10x cost savings versus streaming.

When streaming wins

Streaming wins when latency is part of the product. Chat, search, real-time triage; long waits feel broken to the user. Server-sent events deliver the first token in milliseconds and the full response in seconds, which is what the synchronous experience requires.

Cost and latency numbers

The numbers drive the choice. Streaming a 4k-token request to a frontier model lands around $0.05 with p95 latency of about 3 seconds; batching the same request costs roughly $0.025 with latency from minutes to 24 hours depending on batch size. The user’s wait tolerance is the deciding factor.

Hybrid patterns

Hybrid patterns capture both savings and latency. Pre-compute via batch overnight, serve via streaming during the day; cache hits at the streaming layer mean the batch absorbed the cost and the user pays only the streaming latency. This is what most production AI products eventually arrive at.