Process Reward Models: Supervising Reasoning
Outcome-only rewards train models to get the right final answer. Process rewards train them to reason correctly along the way. The difference matters enormously for hard problems.
Outcome rewards
An outcome reward model says: was the final answer right? It’s simple, cheap to verify (especially in math and code where you can run tests). It’s also a weak signal, many wrong reasoning paths produce right answers occasionally, and the model learns nothing about why.
Process rewards
A process reward model (PRM) scores each step in a reasoning chain. Did this step follow correctly? Was this intermediate computation right? The model learns not just to get the right answer but to reason in the right way.
Example: solving a multi-step math problem. Outcome reward says “final answer wrong, total reward 0.” Process reward says “step 1 correct, step 2 correct, step 3 had an arithmetic error, the rest depends.” The model gets a much richer signal.
The labelling cost
The hard part is labelling. For each multi-step solution, every step needs to be evaluated. This is 5-20x the labelling cost of outcome-only.
Tricks to reduce cost:
- Auto-grading on math/code: synthesise problems where intermediate state can be verified mechanically.
- LLM-as-grader: use a strong model to label process correctness. OK on average, struggles with creative steps.
- Hybrid: humans label a small high-quality set; LLMs propagate to a much larger one.
Connection to reasoning models
The 2024-2025 wave of reasoning models (o1, o3, DeepSeek-R1, Claude reasoning mode) leans heavily on process rewards. The model is trained to extend its chain of thought, with the PRM rewarding correct reasoning steps even when the final answer is partial.
This is what unlocked the leap in math and coding benchmarks: the models aren’t smarter in the same way; they’ve been taught to think in longer, more reliable chains.
When to care
If you’re training your own model on multi-step reasoning tasks (math, complex code review, multi-hop QA), PRMs are the technique to study.
If you’re consuming LLMs via API: PRMs are why the “reasoning” models from Anthropic, OpenAI, and Google work better on hard problems. You don’t implement a PRM yourself; you benefit from one trained into the model you’re calling.
The takeaway for practitioners: when a task has clear stepwise structure (math, code, planning), use a reasoning-mode model. The PRM-trained version will outperform a standard chat model by 20-40 percentage points on the hardest examples.