Inference Cost
Inference cost; rightsizing GPUs.
Cost dimensions
Inference cost takes different shapes depending on how the model is served. Per-token billing dominates LLM APIs; per-request scales for smaller hosted models; per-second GPU compute drives self-hosted costs. Modern APIs add a fourth dimension via prompt caching: cached prefix tokens cost a fraction of fresh ones.
- Per-token cost (LLM APIs). Scales with prompt plus response length. Track input and output tokens per request.
- Per-request cost. Smaller hosted models and embeddings. Lower per call, higher volume.
- Per-second GPU compute (self-hosted). GPU hours and concurrent throughput. Cluster utilisation drives the unit cost.
- Prompt-cache cost. Cached-context write and read pricing. Long static prefixes cost a fraction of fresh tokens.
Optimisation patterns
Right-sizing, caching, batching, prompt caching, max_tokens caps. None of these are exotic, most teams skip them, all of them compound. The single biggest near-term win is usually adding cache_control to static system prompts; the next biggest is capping max_tokens per task type.
- Right-size GPUs. A10 for small models; A100 or H100 for large. Match the GPU to the model rather than over-provisioning.
- Hash-based response caching. Same prompt, same response is free. Standard middleware win.
- Batching plus prompt caching. Multiple requests per inference call raises GPU utilisation;
cache_controlon static prefixes shrinks per-call cost dramatically. - Per-call
max_tokens. Explicit cap per task type. Real cost protection against runaway responses.
Model choice for cost
Model routing is the single biggest cost lever. Use the smallest model that produces correct output for the task; reserve frontier models (Opus, GPT-4-class) for tasks that genuinely need them. Cascade patterns route cheaper-first and escalate only on low confidence.
- Smallest competent model per task. Haiku or equivalent for classification, extraction, formatting; Sonnet for medium tasks; Opus only when needed.
- Domain-specific fine-tunes. Outperform general models at lower cost for high-volume use cases.
- Open-weight self-hosting. Llama or Mistral for workloads where infrastructure cost beats API pricing at scale.
- Per-call cascade. Cheap model first; escalate to larger model on low confidence. Real cost reduction without quality regression.
Monitoring inference cost
Without per-feature attribution, AI cost is a black box and the bill arrives as a surprise. Tag every request with a feature label, track per-user cost, review optimisation candidates quarterly, define per-feature cost SLOs so unit economics stay visible.
- Per-feature attribution. Feature tag on every request. Cost allocation becomes tractable.
- Per-user cost analysis. Identifies cost-driving users; informs pricing decisions and abuse detection.
- Quarterly optimisation review. Candidate review per quarter. Catches drift before it becomes the next surprise invoice.
- Per-feature cost SLO. Cost-per-action SLO per feature. Unit economics stay measurable.