LLM Caching: Cutting Cost 80%
Most LLM bills could be cut in half with caching alone. Most aren’t, because nobody reaches for the obvious lever. Here are the four cache types that work.
The four cache types
Different parts of an LLM call can be cached at different layers. The four worth knowing:
- Exact-match: identical request returns identical response. App-layer.
- Semantic: similar request returns previous response. App-layer.
- Prompt-prefix: provider caches prefix processing of repeated system prompts. Provider-layer.
- KV cache: model-server caches attention states for repeated context. Server-layer.
Exact-match cache
Hash the full request payload (model, prompt, parameters). Look up; if cached, return the saved response. Otherwise call the model and cache the response.
Hit rates: surprisingly high in production. Documentation Q&A, FAQ-like apps, dev/staging traffic, internal tools all show 30-70% hit rates. A single SHA-256 lookup pays back many LLM calls.
What kills exact-match: timestamp injections, user IDs in prompts, random nonces. Strip them before hashing.
Semantic cache
Embed the incoming query, search a cache of (embedding, response) pairs by similarity, return the response if a match is close enough.
Higher hit rates than exact-match (you catch “what’s our refund policy?” matching “tell me about refunds”) but riskier: a false-positive cache hit returns the wrong answer to the user.
Mitigations: high similarity threshold (0.95+), refresh cache when source content updates, monitor for false positives via sampled human review.
Prompt-prefix cache (provider-side)
OpenAI, Anthropic, and Google all support prompt caching: mark a system prompt or RAG context as cacheable, and the provider caches the prefill computation. Subsequent calls with the same prefix skip the prefill phase.
Savings: 50-90% on input tokens for the cached portion, 50-100% on input latency. For RAG systems with large stable contexts, this is the biggest cache win available.
Activate it on every system prompt and any retrieved context that’s reused across requests. Free with one extra parameter.
KV cache
Inside the model, attention computes Key and Value matrices for every token. For autoregressive generation, the K/V for previous tokens doesn’t change as you add new tokens. The model server caches them.
This is automatic in vLLM, TGI, and provider APIs. You don’t configure it; you benefit from it. The reason streaming is fast.
What’s relevant: KV cache memory grows linearly with context length and concurrent users. Long contexts are expensive partly because of KV-cache memory pressure. Specialised techniques (paged attention, sliding window) keep this under control at scale.
Production layout
A reasonable shape:
- App layer checks exact-match cache (Redis). Hit: return.
- App layer checks semantic cache. Hit (above threshold): return.
- Call the LLM with prompt-prefix cache enabled.
- Provider returns; cache the response in semantic and exact-match stores.
This layered design means you pay full price only when the request is genuinely new and dissimilar from prior ones. For most apps, that’s 20-50% of traffic. The other half saves you 80%+ on LLM cost.