Test-Time Compute and Iterative Reasoning
Spending more compute at inference time, not training time, is the new lever. Models that ‘think longer’ on hard problems outperform much larger models that don’t.
The shift
Pre-2024, capability scaled with training compute. The o1-style reasoning models showed that capability also scales with inference compute, potentially as much. Spend 100x more tokens reasoning per query, accuracy on hard problems jumps non-linearly.
Techniques
- Extended reasoning: the model emits long internal chains of thought before answering.
- Self-consistency: generate K answers, return the most common.
- Search: tree-of-thoughts, beam search over reasoning steps, MCTS.
- Verification: generate, check, correct, repeat.
The economics
Test-time compute is dial-able. Easy queries: minimal thinking, cheap. Hard queries: extended thinking, expensive but accurate. This is fundamentally more economical than “always use the biggest model.” Frontier labs are organising around this dial.