FinOps & Cost Practical By Samson Tanimawo, PhD Published Nov 3, 2025 4 min read

Inference Cost

Inference cost; rightsizing GPUs.

Cost dimensions

Per-token cost (LLM APIs): scales with prompt and response length.

Per-request cost (smaller models, embeddings): often lower per-call but higher volume.

Per-second compute (self-hosted): GPU hours and concurrent throughput.

Right-size GPUs. Smaller GPUs (A10) for small models; larger (A100, H100) for large.

Caching: same prompt, same response. Hash-based cache hits free.

Batching: multiple requests per inference call. Higher GPU utilisation.

Smaller models for simple tasks. GPT-3.5 or Haiku for classification; GPT-4 or Opus only when needed.

Domain-specific fine-tunes outperform general models at lower cost. Worth the engineering investment for high-volume use cases.

Open-weight models for self-hosting. Llama 3, Mistral. Cost depends on infrastructure but can beat APIs at scale.

Per-feature attribution. Tag inference requests; allocate cost.

Per-user cost analysis. Identifies cost-driving users; informs pricing.

Quarterly review. Optimisation candidates surfaced; investments tracked.