Inference Cost
Inference cost; rightsizing GPUs.
Cost dimensions
Per-token cost (LLM APIs): scales with prompt and response length.
Per-request cost (smaller models, embeddings): often lower per-call but higher volume.
Per-second compute (self-hosted): GPU hours and concurrent throughput.
Optimisation patterns
Right-size GPUs. Smaller GPUs (A10) for small models; larger (A100, H100) for large.
Caching: same prompt, same response. Hash-based cache hits free.
Batching: multiple requests per inference call. Higher GPU utilisation.
Model choice for cost
Smaller models for simple tasks. GPT-3.5 or Haiku for classification; GPT-4 or Opus only when needed.
Domain-specific fine-tunes outperform general models at lower cost. Worth the engineering investment for high-volume use cases.
Open-weight models for self-hosting. Llama 3, Mistral. Cost depends on infrastructure but can beat APIs at scale.
Monitoring inference cost
Per-feature attribution. Tag inference requests; allocate cost.
Per-user cost analysis. Identifies cost-driving users; informs pricing.
Quarterly review. Optimisation candidates surfaced; investments tracked.