AI & ML Advanced By Samson Tanimawo, PhD Published Dec 27, 2026 5 min read

LLM Gateway Design

An LLM gateway sits between your app and the providers. Routing, caching, fallback, observability, and cost control all live here. Building one is a weekend; not having one is a year of small fires.

What it does

An LLM gateway is a service in front of LLM API calls. It handles cross-cutting concerns: authentication, rate limiting, model routing, retry logic, cost tracking, observability, response caching, output filtering. Without a gateway, every application reimplements these; with one, they're centralised and consistent.

The motivation. Multiple applications calling LLMs need the same infrastructure: error handling, rate limiting, cost tracking. Reimplementing in each app produces inconsistency, bugs, security gaps. Centralising in a gateway produces consistency and pays back the gateway's build cost in months.

The "why now" timing. As organizations move beyond a single LLM use case, gateway needs become acute. Five teams each calling OpenAI separately means five different rate-limit configurations, five different retry strategies, five different cost trackings. The gateway is the architectural answer.

The honest scope. Gateways aren't magic; they're plumbing. Done well, they make LLM operations professional. Done poorly, they're another layer of latency and bugs. The investment is real; the pay-off is real if you're at the right scale.

OSS options

The 2026 landscape:

LiteLLM, Python library and standalone server. Wraps 100+ providers behind OpenAI-compatible API. The most-used open-source option; battle-tested at scale.
Portkey, open-source plus managed offering. Stronger observability and prompt management features.
Helicone, observability-focused; Apache-licensed; widely used for LLM call tracing.
OpenRouter, managed service that abstracts providers; not OSS but worth knowing.
Custom in-house, many large companies build their own gateway. Justifiable when needs are specific.

The LiteLLM case. Free, OSS, mature. Drop-in replacement for OpenAI client across many providers. Standalone server mode for centralized deployment. Most teams start here; some never need more.

The Portkey case. More features than LiteLLM in some areas (prompt management, A/B testing, advanced observability). Open core + paid features. For teams that need the additional features, the upgrade is worthwhile.

The Helicone case. Observability-first. Tracing, debugging, request inspection. Often used WITH a gateway (LiteLLM + Helicone) rather than as the gateway itself.

The custom-build decision. Build your own when: highly specific requirements (regulated industry, internal tooling integration), significant scale (cost of OSS vs maintenance crosses over), security needs that OSS can't meet. Most teams don't need custom; the OSS options are mature.

Must-have features

The features that distinguish a useful gateway from a thin wrapper:

Provider failover, when OpenAI is down, route to Anthropic. Real outages happen; failover keeps your product up.
Cost tracking per request, tag with feature, customer, environment. Aggregate for analysis.
Token-based rate limiting, limit by tokens-per-minute, not requests-per-minute. Reflects cost and provider rate limits.
Response caching, exact-match and semantic caches reduce duplicate calls.
Streaming support, must work for streaming responses; many gateways struggle here.
Tracing, every request logged with prompt, response, latency, cost.
Output filtering, prevent prompt-injection responses from reaching users.

The failover detail. Detect provider errors (5xx, rate limits, timeouts). Retry against a different provider. Translate prompts/responses across provider formats. Without failover, your product's uptime is bounded by your worst provider's uptime.

The cost-tracking detail. Per-request: which feature, which customer, which model, input tokens, output tokens, total cost. Stream to a data warehouse for aggregation. Real-time cost visibility is the foundation of cost engineering.

The rate-limiting detail. Provider rate limits are usually in tokens-per-minute. Your gateway's rate limits should match, limit by tokens, not by requests, to avoid hitting provider limits and to prevent any single tenant from monopolising capacity.

The caching detail. Exact-match: identical prompts return cached response. Semantic: similar prompts return similar cached response. Both are useful; semantic is more aggressive but riskier (false-positive cache hits return wrong responses).

The streaming detail. Streaming responses don't fit traditional request-response patterns. Gateway must pipe tokens through as they arrive. Many basic gateways break streaming; verify before depending.

Build vs buy

For most teams in 2026, buy (or use OSS). LiteLLM, Portkey, Helicone, OpenRouter all work. Building in-house is justified when you have specific compliance requirements, multi-tenant routing complexity, or scale that exceeds OSS capacity. For 95% of use cases, OSS is the right choice.

The build cost. Initial: 4-12 engineer-weeks. Ongoing: 0.5-2 FTE for maintenance, feature additions, incident response. Over 3 years: $500K-$2M. The pay-back is in operational value; for teams below a certain scale, the OSS alternative is faster and cheaper.

The compliance case for build. Some regulated industries require all data flow through internal systems. Custom gateway lets you satisfy specific compliance requirements (audit logging, data residency, encryption-at-rest with specific algorithms). OSS can't always meet these.

The multi-tenant case for build. SaaS companies serving many customers each with their own LLM credentials. Per-customer rate limits, per-customer cost tracking, per-customer model preferences. Some OSS options handle this; some don't. Evaluate; build if necessary.

The scale case for build. Above ~$1M/month LLM spend, custom optimisations start to pay back. Cache hit rate improvements, batched inference for routing, custom fallback logic. The optimisation work amortises over the spend; smaller scale doesn't justify it.

The hybrid pattern. Use OSS for the bulk; add a thin custom layer for the specific needs. LiteLLM as the core; custom routing logic for your tenant model. Best of both worlds; minimum custom code.

Common antipatterns

Direct API calls from every application. Inconsistency proliferates; cost tracking impossible. Even a thin gateway is better than none.

Build before evaluating OSS. Most needs are met by mature OSS. Evaluate first; build only if real gaps exist.

Skipping output filtering. Prompt injection responses can leak to users. Add filters as part of gateway design.

No per-request tracing. Debug becomes impossible. Trace from day one.

What to do this week

Three moves. (1) Audit your current LLM API call paths. Multiple-application setups without a gateway need one. (2) If you don't have a gateway, deploy LiteLLM in a day. The basic features cover most needs immediately. (3) Verify cost-tracking-per-request. Without it, optimisation work is blind; with it, you know where to focus.