AI & ML Advanced By Samson Tanimawo, PhD Published Dec 27, 2026 5 min read

Cost Engineering for LLM Apps

LLM costs scale linearly with usage by default. Cost engineering bends that curve. Most apps can cut spend 60-80% with no quality loss.

Input-side levers

Most LLM cost is input tokens (input is typically 2-10x output volume). The biggest cost lever is reducing input tokens.

Prompt caching, most providers cache static prompt prefixes. 5-10x discount on cached tokens. Add cache_control on system prompts and few-shot examples. Free; immediate impact.
Shorter system prompts, every word costs every request. Audit; remove redundancy.
Fewer few-shot examples, examples cost per request. Move from 10 examples to 3 if quality holds.
Smaller retrieval contexts, RAG with 1000-token context vs 8000 cuts cost 8x.

The prompt-caching priority. If your provider supports it (Anthropic, OpenAI, Google all do as of 2025), use it. Cache the system prompt; cache few-shot examples. The 5-10x discount on cached tokens is essentially free money for any application that has a stable prompt prefix.

The system-prompt audit. Look at your system prompt. Each word costs every request. Vague language ("be helpful") is cheap. Detailed instructions ("you are a helpful assistant who responds with detailed step-by-step explanations") cost more per request and may not improve quality. Tighten ruthlessly.

The few-shot reduction. Test reducing examples. Often 3 well-chosen examples outperform 10 mediocre ones. Each removed example saves cost on every call. The reduction work pays back fast.

The retrieval-context tightening. RAG that retrieves 8K tokens "to be safe" costs 8x more than retrieving 1K tokens. Test: does the larger context actually improve quality? Often not. Tighten to the smallest context that maintains quality.

Output-side

Outputs are smaller in volume but more expensive per token. Levers:

Max-tokens caps, never leave at default. Cap to "most expensive task this would generate".
Structured output, JSON schema constraints reduce verbose narration.
Stop sequences, model stops as soon as the answer is complete.
Smaller models for simple tasks, Haiku for classification, Sonnet for code, Opus only when needed.

The max-tokens default. Many SDK defaults are 4K or 8K, much larger than most use cases need. A classification task needs 10 tokens; a chat response needs 200-500. Setting max_tokens aggressively is free and catches runaway generations.

The structured-output advantage. JSON schemas eliminate model verbosity. "Output JSON with fields foo and bar" produces concise output; "Tell me about foo and bar" produces verbose narration with the same information. Structured output cuts output tokens 30-70% for tasks that suit it.

The stop-sequence pattern. Define a stop sequence (like \n\n or end-of-section marker). Model halts generation when it reaches the sequence. Useful for tasks where the answer length varies but the boundary is clear.

The model-routing reality. Anthropic's Haiku is 12x cheaper than Sonnet, 60x cheaper than Opus. OpenAI's mini and nano variants are similar. For simple tasks, the small model produces equivalent quality at a fraction of cost. Build a router; benefits compound at scale.

Routing

One model isn't right for everything. Cheap-and-fast for simple queries (classification, extraction, simple Q&A); medium for code generation and reasoning; expensive for the hardest queries. A simple classifier (or small LLM) routes incoming queries; saves 5-20x on cost without sacrificing quality.

The router's role. Receives query; classifies as easy/medium/hard; routes to the appropriate model. The router itself can be a Haiku-class call ($0.001/query) deciding which downstream model to use ($0.01-$1/query). The router's cost is negligible vs the savings it produces.

The classification. Simple classifier: keywords, length, structure suggest difficulty. ML classifier: trained on labeled examples of "this query was easy/hard for the small model". The trained classifier is more accurate; takes engineering investment to build.

The fallback pattern. Try the cheap model; if confidence is low, escalate to the more expensive one. The cheap model handles the easy cases; expensive model only handles the hard cases. Implementation: confidence-based routing rather than upfront classification.

The savings. Workloads with mixed difficulty often save 5-10x with routing. The unmix is the source of waste; the router unmixes by sending each query to the right model. Production at scale always has routing; small projects often skip it and overspend.

The cost audit

Quarterly: pull last quarter's LLM spend; break it down by feature, query type, model. Find the top 3 cost drivers. Optimise those. Re-audit. Most teams that have been running LLM apps for a year discover 30-60% reducible cost in their first audit.

The audit methodology. Tag every API call with feature, environment, customer tier. Aggregate spend by tag. Sort by cost. The top items are where optimisation effort goes; lower items aren't worth the effort.

The first-audit surprises. Common findings: one feature using Opus for tasks Haiku could handle. One feature with 4x more retrieval context than needed. One internal debug tool calling production model in a loop. Each is a single fix; combined they often produce 50%+ savings.

The continuous-monitoring direction. After the first audit, build dashboards that surface cost spikes early. New features that suddenly cost 10x. Old features that grew quietly. Automated alerts on cost growth catch problems before they're surprises.

The cost-aware development culture. Engineers who see their feature's cost make different decisions. Per-feature cost dashboards visible to engineers; cost reviews part of feature launches. Cultural change takes time but compounds.

Common antipatterns

Defaulting to the most capable model. Cost without proportional benefit on easy tasks. Match model to task difficulty.

Skipping prompt caching. Free 5-10x discount left on the table. Always cache stable prompt prefixes.

No max-tokens caps. Runaway generation occasionally bills $100 for a query that should cost $0.10.

Cost monitoring at the project level only. Need per-feature, per-customer attribution to find optimisation targets.

What to do this week

Three moves. (1) Run a one-day cost audit. Pull spend by feature; sort. The top 3 features are where to optimise. (2) Add prompt caching to your highest-volume API. The benefit is immediate; the work is hours. (3) Cap max_tokens on every call. Pick a reasonable cap based on the task; document why. The cap is your seatbelt against runaway costs.