Inference Cost

Inference cost; rightsizing GPUs.

Cost dimensions

Inference cost takes different shapes depending on how the model is served. Per-token billing dominates LLM APIs; per-request scales for smaller hosted models; per-second GPU compute drives self-hosted costs. Modern APIs add a fourth dimension via prompt caching: cached prefix tokens cost a fraction of fresh ones.

Optimisation patterns

Right-sizing, caching, batching, prompt caching, max_tokens caps. None of these are exotic, most teams skip them, all of them compound. The single biggest near-term win is usually adding cache_control to static system prompts; the next biggest is capping max_tokens per task type.

Model choice for cost

Model routing is the single biggest cost lever. Use the smallest model that produces correct output for the task; reserve frontier models (Opus, GPT-4-class) for tasks that genuinely need them. Cascade patterns route cheaper-first and escalate only on low confidence.

Monitoring inference cost

Without per-feature attribution, AI cost is a black box and the bill arrives as a surprise. Tag every request with a feature label, track per-user cost, review optimisation candidates quarterly, define per-feature cost SLOs so unit economics stay visible.