LLM Routing: Haiku for Cheap, Opus for Hard
Sending every request to your most capable model is the most expensive mistake in LLM engineering. A simple router cuts cost 60-80% without measurable quality loss.
Why route at all
Models scale 10-30x in cost from cheap (Haiku, GPT-4o-mini, Gemini Flash) to flagship (Opus, GPT-4o, Gemini Pro). The cheap models are surprisingly capable on simple tasks. They’re catastrophically inadequate on hard reasoning.
Most production traffic is simple. Classification, extraction, FAQ-style answering, summarisation. The cheap model is fine 70-90% of the time. Routing means using it when it works and the expensive model only when needed.
Classifier-based routing
The robust approach: train a small classifier (or use a small LLM) to predict request difficulty. Easy goes to cheap; hard goes to expensive.
Features that predict difficulty:
- Prompt length and complexity.
- Presence of multi-step reasoning markers (“explain”, “analyse”, “why”).
- Domain (code > text, math > FAQ).
- Past traffic: similar requests that failed on cheap → route hard.
The classifier doesn’t need to be sophisticated. Logistic regression on prompt embeddings + a few hand-crafted features hits 90%+ routing accuracy on most workloads.
Heuristic shortcuts
Before building a classifier, try heuristics. They handle the 80% case for free:
- Short prompts (< 200 tokens) without code: cheap model.
- Prompts with code blocks: expensive model.
- Prompts asking “why” or “explain”: expensive.
- Extraction tasks (“return JSON with X, Y, Z”): cheap.
- Multi-turn chat with > 5 prior turns: expensive (context complexity).
These rules don’t need ML. They cover most of the win and let you measure routing impact before investing in classifier training.
The routing eval set
You can’t route well without measuring. Build an eval of (input, ideal-model-tier, ideal-output) triples. For each request:
- Run cheap model. Measure quality.
- Run expensive model. Measure quality.
- If cheap is ≥ threshold: route cheap.
- Else: route expensive.
Once you have this labelled, train (or write rules for) the router. The eval lets you measure routing quality on new data and catch regressions when you change the router.
Cascade pattern
An aggressive variant: try cheap first, evaluate the response with a cheap classifier, escalate to expensive if quality is poor.
The cascade pays the cheap model’s cost on every easy query and a small evaluation cost. On hard queries, you pay both cheap (wasted) and expensive. Net cost is below sending everything to expensive but above optimal routing.
Use cascade when you can’t pre-classify difficulty cheaply. Use direct routing when you can. In practice, prompts that look easy are almost always easy. Pre-classification works for 90% of traffic without cascade overhead.