LLM and GenAI Cost Engineering
GenAI costs are unique; the levers are well-known but rarely all applied. Done right, 70-90% reduction is realistic.
Why GenAI cost is different
Cost scales with token volume × model tier. A bad architecture can 10x the bill at the same user volume.
The optimization opportunities are large.
Four levers
- 1. Model routing (cheap model for easy tasks).
- 2. Prompt caching (Anthropic, OpenAI both support).
- 3. Batch API for non-realtime.
- 4. Fine-tuning instead of long prompts.
Per-lever savings
Routing: 30-50% savings. Caching: 50-90% on cacheable prompts. Batch: 50% off realtime price. Fine-tuning: model-dependent.
Combined: 70-90% common.
Cost-aware GenAI culture
Engineers see per-feature LLM cost; per-PR cost-impact estimate; cost-aware reviews.
The discipline mirrors traditional cost engineering.
Antipatterns
- Default to expensive model for everything. 10x bill premium.
- No prompt caching. Pay full price for every cached prompt.
- No per-feature cost view. Cannot optimize.
What to do this week
Three moves. (1) Apply this lever to your highest-spend workload. (2) Measure the dollar impact for one month. (3) Roll the practice out to the next two services if the savings hold.