AI & ML Advanced By Samson Tanimawo, PhD Published Jun 16, 2026 5 min read

Reasoning Models: o1-Style Architecture

OpenAI’s o1 introduced ‘thinking models’ that produce a long internal reasoning chain before answering. The pattern has spread. Here is what makes them different.

The core idea

A standard LLM emits one token at a time toward the answer. A reasoning model emits internal “thinking” tokens (often hidden from the user) before producing the final answer. Those thinking tokens are the model’s working scratchpad, problem decomposition, intermediate computations, self-correction.

Training reasoning models

Trained with reinforcement learning on chains of thought, often with process reward models that score reasoning steps not just final outputs. The training data emphasises problems with verifiable answers (math, code, formal logic) where reward is unambiguous.

Test-time compute

The unusual feature: longer thinking yields better answers, predictably. You can dial latency vs accuracy at inference time. A “low-effort” reasoning call uses 10x normal tokens; “high-effort” uses 100x. Cost scales linearly; accuracy on hard problems often jumps non-linearly.

Gains and where they apply

Reasoning models extend the frontier on:

They underperform on tasks where speed matters more than depth: customer-support chat, simple extraction, anything latency-sensitive. Use a regular model for those; reach for reasoning models when you need the answer right and can wait.