Cost Engineering for LLM Apps
LLM costs scale linearly with usage by default. Cost engineering bends that curve. Most apps can cut spend 60-80% with no quality loss.
Input-side levers
- Prompt compression: a system prompt of 2000 tokens used 10x is 20K tokens of input on every call. Trim ruthlessly.
- Prefix caching: provider-side caching of repeated prefixes saves up to 90% on input tokens.
- Context pruning: don’t send the whole chat history; summarise old turns.
Output-side
- Output limits: cap max_tokens. Most apps default too high.
- Streaming: doesn’t reduce cost but improves perceived speed.
- JSON mode: structured outputs are tighter than free text.
Routing
The biggest single lever. Route 70-90% of traffic to a cheap model. Reserve frontier for the queries that need it. Saves 60-80% of total spend.
The cost audit
- Pull cost by model and customer for the last 30 days.
- Identify the top 10 spend contributors.
- For each, ask: could a cheaper model + better prompt do it? Could caching shave 50%? Could batching halve compute?
Quarterly audit keeps growth in line with usage rather than ahead of it.