AI & ML Intermediate By Samson Tanimawo, PhD Published Sep 2, 2025 7 min read

LLM Routing: Haiku for Cheap, Opus for Hard

Sending every request to your most capable model is the most expensive mistake in LLM engineering. A simple router cuts cost 60-80% without measurable quality loss.

Why route at all

Models scale 10-30x in cost from cheap (Haiku, GPT-4o-mini, Gemini Flash) to flagship (Opus, GPT-4o, Gemini Pro). The cheap models are surprisingly capable on simple tasks. They’re catastrophically inadequate on hard reasoning.

Most production traffic is simple. Classification, extraction, FAQ-style answering, summarisation. The cheap model is fine 70-90% of the time. Routing means using it when it works and the expensive model only when needed.

Classifier-based routing

The robust approach: train a small classifier (or use a small LLM) to predict request difficulty. Easy goes to cheap; hard goes to expensive.

Features that predict difficulty:

The classifier doesn’t need to be sophisticated. Logistic regression on prompt embeddings + a few hand-crafted features hits 90%+ routing accuracy on most workloads.

Heuristic shortcuts

Before building a classifier, try heuristics. They handle the 80% case for free:

These rules don’t need ML. They cover most of the win and let you measure routing impact before investing in classifier training.

The routing eval set

You can’t route well without measuring. Build an eval of (input, ideal-model-tier, ideal-output) triples. For each request:

Once you have this labelled, train (or write rules for) the router. The eval lets you measure routing quality on new data and catch regressions when you change the router.

Cascade pattern

An aggressive variant: try cheap first, evaluate the response with a cheap classifier, escalate to expensive if quality is poor.

The cascade pays the cheap model’s cost on every easy query and a small evaluation cost. On hard queries, you pay both cheap (wasted) and expensive. Net cost is below sending everything to expensive but above optimal routing.

Use cascade when you can’t pre-classify difficulty cheaply. Use direct routing when you can. In practice, prompts that look easy are almost always easy. Pre-classification works for 90% of traffic without cascade overhead.