Streaming LLM Responses: UX + Latency Math
Time-to-first-token is the metric users feel. Total latency is the metric your bill cares about. Streaming optimises the first; everything else optimises the second.
Why streaming matters
An LLM that returns a 500-token response in 8 seconds feels broken. The same response streamed at 60 tokens/second, with the first token visible after 600ms, feels fast.
The total time is identical. The user’s perception is not. Streaming converts a long blocking wait into a continuous progress signal. It’s the cheapest UX win in the LLM stack.
The four latency metrics
- Time to first byte (TTFB): from request to first byte returned. Network + server-side prefill.
- Time to first token (TTFT): from request to first model-emitted token. The user-perceived “is it working” signal.
- Inter-token latency (ITL): time between consecutive tokens during streaming. Determines perceived “reading speed.”
- Total time (TTLT): from request to last token. The wall-clock cost.
Optimisation tools differ per metric. TTFT is dominated by prompt length and prefill optimisations (caching, paged attention). ITL is dominated by model size and decoding speed. TTLT is dominated by output length.
SSE vs WebSocket
Two transport options:
Server-Sent Events (SSE): HTTP one-way push. Built into browsers (EventSource). Simple to implement on any HTTP server. Works through most corporate proxies. The default for streaming text from LLMs.
WebSockets: bidirectional. Useful when the client needs to send mid-stream events (cancellation, hot edits). More complex; more proxy issues; not necessary for one-way text streaming.
Pick SSE unless you specifically need bidirectional. OpenAI, Anthropic, and most LLM APIs use SSE for streaming.
Failure modes to plan for
- Connection drops mid-stream. The user sees a half-answer. Implement client-side resume or graceful retry of the rest of the response.
- Slow consumer backpressure. The client’s buffer fills, the server stalls. Use server-sent buffering with reasonable timeout (>10s; LLMs occasionally pause when reasoning).
- Truncation surprise. Model hits max_tokens mid-sentence. Show a clean indicator to the user (“...”) rather than a hard cut.
- JSON-mode streaming. Streaming JSON before it’s complete is invalid JSON. Either don’t stream JSON, or stream as “assembling” with a final commit.
How to measure right
Log all four metrics per request, not just total time. Histograms, not means, the slow tail matters.
The two metrics most worth alerting on:
- P95 TTFT > 2 seconds: users notice. Investigate prompt size, prefill, queue.
- P95 ITL > 100ms: users feel choppy reading. Investigate model size, batching, server load.
Total latency matters less than these two. A 30-second response with 500ms TTFT and 50ms ITL feels great. A 5-second response with 4-second TTFT feels broken.