Speculative Decoding: How Models Hit 1000 tok/sec
A small “draft” model proposes several tokens; the big model verifies them in parallel. The big model still produces every token but does much less sequential work.
The core idea
Standard autoregressive decoding generates one token at a time. Each token requires a forward pass through the entire model. Sequential. Slow at scale.
Speculative decoding adds a small “draft” model that proposes K tokens at a time. The big “target” model verifies all K in parallel in a single forward pass. Tokens that match the target’s predictions are accepted; tokens that don’t are corrected and the process restarts.
The math: if the draft model agrees with the target on average M of K tokens, you generate M+1 tokens per target forward pass instead of 1.
How verification works
Crucially, this is exact, not approximate. The target model still produces every token according to its own probability distribution. The draft is just a guess that’s either accepted (free speedup) or corrected (no penalty beyond the wasted draft step).
The trick: the target processes all K draft tokens in a single batched forward pass. Compute cost is roughly the same as one regular step. If draft accuracy is 60%, you net a 60% speedup with zero quality change.
Choosing a draft model
Three options:
- Smaller model from the same family (e.g., Llama-7B drafting for Llama-70B). Easy alignment of distributions, modest speedup (1.5-2x).
- EAGLE/Medusa heads: lightweight prediction heads added to the target itself. Higher draft accuracy, lower compute. State-of-the-art for 2024-2025.
- N-gram drafting: classical statistical drafter. Free, surprisingly effective on repetitive content (code, structured output).
EAGLE-style drafting on a 70B target hits 3-4x speedup on most chat workloads. It’s the dominant technique in modern serving.
Real-world gains
Production deployments report:
- vLLM with speculative decoding: 2-3x throughput on Llama-class models.
- Cerebras and Groq: hardware-level parallelism plus speculation reaches 1000+ tok/sec on 70B-class models.
- Open-weight inference servers: 30-50% real-world speedup with default speculator settings.
The gains are larger on predictable-content tasks (code completion, structured generation) and smaller on creative tasks where the draft frequently disagrees.
What comes after
Speculative decoding is one of several techniques converging on the same goal: more output tokens per unit of GPU work. Related lines:
- Look-ahead decoding (no draft model needed; uses N-gram caches).
- Speculative sampling extensions for tree-of-tokens (multiple branches per step).
- Hardware-level acceleration via on-die SRAM (Cerebras Wafer-Scale, Groq LPU).
The throughput frontier of LLM inference is being pushed faster by serving infrastructure than by model architecture changes. Expect serving costs to keep falling 2-3x per year for the same model quality.