AI & ML Advanced By Samson Tanimawo, PhD Published Oct 28, 2025 7 min read

Speculative Decoding: How Models Hit 1000 tok/sec

A small “draft” model proposes several tokens; the big model verifies them in parallel. The big model still produces every token but does much less sequential work.

The core idea

Standard autoregressive decoding generates one token at a time. Each token requires a forward pass through the entire model. Sequential. Slow at scale.

Speculative decoding adds a small “draft” model that proposes K tokens at a time. The big “target” model verifies all K in parallel in a single forward pass. Tokens that match the target’s predictions are accepted; tokens that don’t are corrected and the process restarts.

The math: if the draft model agrees with the target on average M of K tokens, you generate M+1 tokens per target forward pass instead of 1.

How verification works

Crucially, this is exact, not approximate. The target model still produces every token according to its own probability distribution. The draft is just a guess that’s either accepted (free speedup) or corrected (no penalty beyond the wasted draft step).

The trick: the target processes all K draft tokens in a single batched forward pass. Compute cost is roughly the same as one regular step. If draft accuracy is 60%, you net a 60% speedup with zero quality change.

Choosing a draft model

Three options:

EAGLE-style drafting on a 70B target hits 3-4x speedup on most chat workloads. It’s the dominant technique in modern serving.

Real-world gains

Production deployments report:

The gains are larger on predictable-content tasks (code completion, structured generation) and smaller on creative tasks where the draft frequently disagrees.

What comes after

Speculative decoding is one of several techniques converging on the same goal: more output tokens per unit of GPU work. Related lines:

The throughput frontier of LLM inference is being pushed faster by serving infrastructure than by model architecture changes. Expect serving costs to keep falling 2-3x per year for the same model quality.