AI & ML Intermediate By Samson Tanimawo, PhD Published Sep 9, 2025 6 min read

Streaming LLM Responses: UX + Latency Math

Time-to-first-token is the metric users feel. Total latency is the metric your bill cares about. Streaming optimises the first; everything else optimises the second.

Why streaming matters

An LLM that returns a 500-token response in 8 seconds feels broken. The same response streamed at 60 tokens/second, with the first token visible after 600ms, feels fast.

The total time is identical. The user’s perception is not. Streaming converts a long blocking wait into a continuous progress signal. It’s the cheapest UX win in the LLM stack.

The four latency metrics

Optimisation tools differ per metric. TTFT is dominated by prompt length and prefill optimisations (caching, paged attention). ITL is dominated by model size and decoding speed. TTLT is dominated by output length.

SSE vs WebSocket

Two transport options:

Server-Sent Events (SSE): HTTP one-way push. Built into browsers (EventSource). Simple to implement on any HTTP server. Works through most corporate proxies. The default for streaming text from LLMs.

WebSockets: bidirectional. Useful when the client needs to send mid-stream events (cancellation, hot edits). More complex; more proxy issues; not necessary for one-way text streaming.

Pick SSE unless you specifically need bidirectional. OpenAI, Anthropic, and most LLM APIs use SSE for streaming.

Failure modes to plan for

How to measure right

Log all four metrics per request, not just total time. Histograms, not means, the slow tail matters.

The two metrics most worth alerting on:

Total latency matters less than these two. A 30-second response with 500ms TTFT and 50ms ITL feels great. A 5-second response with 4-second TTFT feels broken.