LLM and GenAI Cost Engineering
GenAI costs are unique; the levers are well-known but rarely all applied. Done right, 70-90% reduction is realistic.
Why GenAI cost is different
GenAI cost has a unique shape: token volume times model tier. A bad architecture can multiply the bill 10x at the same user volume; the optimisation opportunities are correspondingly large.
- Token-volume math. Cost scales with input plus output tokens; payload size matters more than request count.
- Model-tier multiplier. Tier choice (Haiku, Sonnet, Opus) shifts per-token cost by 5-25x; routing dominates.
- Architecture impact. Bad chunking, oversized prompts, expensive defaults can 10x the bill at constant user volume.
- Optimisation upside. 70-90% reduction realistic when all four levers ship; the headroom is unlike traditional compute.
Four levers
- 1. Model routing (cheap model for easy tasks).
- 2. Prompt caching (Anthropic, OpenAI both support).
- 3. Batch API for non-realtime.
- 4. Fine-tuning instead of long prompts.
Per-lever savings
Each lever produces measurable savings. Combined, the four routinely produce 70-90% reduction; ship one at a time, measure each, stack them.
- Routing. 30-50% savings; cheaper model for easy tasks (classification, extraction, formatting).
- Caching. 50-90% savings on cacheable prompts; system prompts and few-shot examples cache cleanly.
- Batch API. 50% off realtime price for non-urgent workloads; offline classification, bulk processing.
- Fine-tuning. Model-dependent; can replace long prompts entirely; 5x cheaper at scale.
Cost-aware GenAI culture
The technical levers only land with cultural support. Engineers need per-feature visibility, PR-time cost estimates, and cost-aware code review for the discipline to stick.
- Per-feature cost. Engineers see what each feature costs in tokens and dollars; visibility drives prioritisation.
- Per-PR estimate. Cost impact of the PR estimated automatically; reviewers can flag expensive changes.
- Cost-aware review. 'Did you consider Haiku for this?' becomes a normal review comment.
- Mirrors traditional FinOps. The discipline is the same; the levers and granularity differ.
Antipatterns
- Default to expensive model for everything. 10x bill premium.
- No prompt caching. Pay full price for every cached prompt.
- No per-feature cost view. Cannot optimize.
What to do this week
Three moves. (1) Apply this lever to your highest-spend workload. (2) Measure the dollar impact for one month. (3) Roll the practice out to the next two services if the savings hold.