Reasoning Models: o1-Style Architecture
OpenAI’s o1 introduced ‘thinking models’ that produce a long internal reasoning chain before answering. The pattern has spread. Here is what makes them different.
The core idea
Reasoning models (OpenAI o1, o3; Claude with extended thinking; DeepSeek R1; Gemini Thinking) spend extra compute at inference time to think through hard problems. They don't just predict the next token, they generate hidden chain-of-thought reasoning, sometimes thousands of tokens of internal deliberation, before producing a final answer.
The shift from training to inference. Traditional progress: bigger model, more training compute, smarter model. Reasoning models added a new axis: same model, more INFERENCE compute, smarter answers. Test-time compute scaling is the new frontier of capability.
The visible vs hidden split. Some models show the reasoning to the user (Claude with extended thinking). Others hide it (o1, o3), the user sees the answer, not the deliberation. The hidden version is the more common production pattern; users don't usually want to read 5,000 tokens of deliberation.
The breakthrough moment. The 2024 announcement of o1 marked the moment reasoning models went mainstream. Performance on math, coding, and science benchmarks jumped substantially. The community recognised this as a different scaling regime, not just incremental progress.
Training reasoning models
Reasoning models are post-trained with reinforcement learning to produce useful chain-of-thought. The training rewards reasoning that arrives at correct answers; the model learns to "think out loud" in productive ways. Models without this post-training can be prompted to do CoT but rarely match the quality of explicitly-trained reasoning models.
The RL setup. Generate many candidate reasoning traces; score by whether they reach correct answers; reward the high-scoring traces. Over training, the model learns reasoning patterns that work. Verifiable problems (math, coding) provide the cleanest reward signal; generic tasks are harder to RL-train this way.
The CoT prompting baseline. Even before reasoning-specific training, prompting "think step by step" elicits some chain-of-thought. The gap between prompted CoT and trained reasoning is substantial, trained models produce more useful, longer, better-structured reasoning.
The reward-hacking risk. Models trained with verifier-based RL learn to produce reasoning the verifier likes. Sometimes the verifier-likes is correlated with truly correct; sometimes the model finds spurious patterns that get rewarded. Detection: hold-out evaluation on verifiers the training process didn't use.
The compute cost. RL training on reasoning is expensive, many candidate traces per problem, scored, used as training signal. Frontier reasoning model training represents a significant fraction of the lab's compute budget. The cost is justified by the capability gains.
Test-time compute
The new dimension of capability. Same model + more inference compute = better answers. The compute-quality curve continues well past where pre-training saturates. For hard problems, doubling test-time compute often doubles the chance of correct answer; the curve is genuinely useful.
The mechanics. The model generates more reasoning tokens before final answer. More tokens = more deliberation = more compute = better quality. The user sets a budget (low/medium/high in some APIs); the model uses up to that much compute.
The scaling law observation. Test-time compute scaling produces capability gains comparable to similar-magnitude training-compute increases. For a fixed task, you can substitute test-time compute for training compute and reach similar quality. The substitution opens new design space.
The cost-per-task math. A reasoning model used at high compute might use 10-100x the tokens of a non-reasoning model on the same task. Cost is correspondingly higher. For tasks where quality matters more than cost, the math works; for high-volume tasks, the cost becomes prohibitive.
The latency reality. High-compute reasoning takes 30-300 seconds per response. Interactive UX doesn't fit; async UX does. Production reasoning use cases batch or backgound the work; users see "thinking..." and get the answer when it's ready.
Gains and where they apply
Reasoning models excel at problems with verifiable structure. Math word problems, competitive programming, scientific reasoning, logic puzzles all see substantial gains. They help less on tasks where reasoning isn't the bottleneck, open-ended writing, summarisation, casual Q&A.
The math gains. Reasoning models score 80-95% on competition math (MATH, AIME) where prior models were 20-50%. The gap is enormous; for math-heavy applications, reasoning models are transformative.
The coding gains. SWE-bench, code-contest, and similar benchmarks improve substantially. The "fix this real GitHub bug" task moves from 30% to 50%+ with reasoning models. Code that compiles and passes tests is the reward signal; the model learns to produce it.
The science-reasoning gains. Multi-step reasoning across scientific papers, chains of inference about mechanisms, analysis of experimental design. The gains are consistent though smaller than math/code (~10-30% improvement typical).
The "doesn't help" cases. Conversational chat, casual Q&A, simple lookups, the bottleneck isn't reasoning; reasoning models cost more without quality benefit. Don't pay reasoning premium for tasks that don't need reasoning.
The mixed-task UX. Production systems often route: easy queries go to fast non-reasoning models; hard queries go to reasoning models. The router itself can be a smaller model that classifies query difficulty. Per-query routing gives most of reasoning's benefit at most of fast model's cost.
Common antipatterns
Using reasoning models for every query. Cost without proportional benefit on easy queries. Route by difficulty.
Hiding reasoning in production debugging. When debugging, exposing the reasoning helps you find what went wrong. Production hides it from users; debugging needs it.
Comparing reasoning to non-reasoning on biased benchmarks. Some benchmarks favor reasoning more than others. Build evals across diverse benchmarks for real comparison.
Confusing "longer answers" with "more reasoning". Some models produce verbose answers without actually reasoning more. Verify by capability, not by token count.
What to do this week
Three moves. (1) For your hardest production task, A/B test a reasoning model vs your current model. The quality lift tells you whether to migrate. (2) Build a difficulty router for routing easy queries to fast models, hard ones to reasoning models. The cost savings are usually substantial. (3) Set per-task compute budgets when using reasoning models. Without budgets, edge-case reasoning loops can produce surprise bills.