AI & ML Advanced By Samson Tanimawo, PhD Published Aug 9, 2026 7 min read

Agentic Reasoning: Tree of Thoughts, ReAct, and Reflection

Three patterns for getting models to reason better at inference time. Each adds compute but unlocks problems the model couldn’t solve in a single pass.

Chain-of-thought, briefly

Chain-of-thought (CoT) prompting was the original "reasoning prompt" technique: ask the model to think step-by-step before answering. CoT improves performance on multi-step problems by giving the model space to work intermediate steps before committing to an answer. It's a single linear chain, one path forward, no exploration.

Why CoT works. Models trained on text learned reasoning by example, humans wrote step-by-step explanations and the model absorbed the pattern. Eliciting that pattern at inference time (with "think step by step") activates the learned reasoning behaviour.

Why CoT plateaus. A linear chain is fragile. One bad step propagates; the model can't backtrack; if the model goes down a wrong reasoning path, it usually doesn't recover. For problems where multiple reasoning paths exist, CoT picks one and commits. ToT and ReAct exist because CoT alone isn't enough for harder problems.

The honest comparison. CoT is great when the problem has one obvious reasoning chain. ToT/ReAct/Reflexion are for problems where exploration matters, where you might need to back up, try a different approach, or use external tools. The choice depends on the problem; CoT is still the right answer for many cases.

Tree of Thoughts

ToT generalises CoT to a tree. At each step, the model generates multiple candidate thoughts. A separate evaluator scores them. The search proceeds along the best paths; bad branches get pruned. ToT outperforms CoT on hard search problems (Game of 24, creative writing) by 10-50%.

The structure. Define a "thought", a small reasoning step that produces a partial state. The model generates k candidate next-thoughts; the evaluator scores each; the best n are kept; the process repeats until terminal. The total search is a beam search through reasoning space.

The compute cost. ToT uses 5-50x the inference compute of CoT (because of branching). For hard problems where CoT scores 30% and ToT scores 70%, the cost is justified. For easy problems where both score 95%, ToT is wasted compute.

The evaluator design. The evaluator's quality dominates ToT's success. Evaluators can be: another LLM call (general but expensive), a heuristic (cheap but problem-specific), or a verifier (perfect but only available for verifiable domains like math). Pick the evaluator that matches your problem's verifiability.

When to use ToT. Problems with deep search trees, partial states that can be evaluated mid-reasoning, and verifiability of intermediate steps. Math word problems, puzzles, planning problems are sweet-spot. Open-ended generation usually isn't.

ReAct

ReAct interleaves thought and action. The model generates a "thought" describing what to do next, then an "action" that calls a tool, then observes the result. The pattern is roughly "Reason → Act → Observe → Reason again". ReAct made tool-use practical; before it, models would emit tool calls without grounding them in context.

The structure. The prompt format alternates: "Thought: I need to find X. Action: search(X). Observation: results... Thought: based on this, I should...". The model emits tool calls in a structured way; the runtime parses and executes; the loop continues until the model emits a final answer.

Why it works. The interleaving pulls reasoning out of the model's head and onto the prompt. Each thought is conditioned on prior observations. The model can't lose track because the trail is explicit. Without ReAct (or similar), models tend to hallucinate tool outputs or skip tool use entirely.

The robustness payoff. Without ReAct framing, agents commonly fail to use tools effectively even when tools are provided. With ReAct, tool use becomes reliable for most modern models. The framing is now standard in agent frameworks (LangChain, LlamaIndex, AutoGen).

The limitations. ReAct is reactive, each step decides only the next action. For tasks requiring multi-step planning, ReAct can wander. Plan-then-execute hybrids (plan first, execute with ReAct) outperform pure ReAct on planning-heavy tasks.

Reflexion

Reflexion adds self-critique. The model attempts a task, evaluates its own attempt, generates a reflection ("I made this mistake; next time I should..."), then re-attempts with the reflection in context. Iterative self-improvement; works especially well on tasks with feedback signals (code that runs, tests that pass).

The loop. Attempt → evaluate → reflect → re-attempt. Each iteration uses the prior attempts and reflections as context. The model "learns" within the conversation, not from gradient updates but from contextual conditioning.

The feedback-signal dependency. Reflexion needs SOMETHING to reflect on. For code, "test pass/fail" is a strong signal. For math, "answer correct/incorrect" is strong. For open-ended writing, the signal is weaker (LLM-judge); Reflexion helps less here.

The compute cost. Each iteration is a full task attempt plus a reflection. Reflexion typically uses 3-5 iterations, so 4-6x the compute of a single attempt. For high-stakes tasks where quality matters, the cost is justified. For volume tasks, single-shot or one-iteration approaches are usually better economics.

The convergence behaviour. Reflexion can converge (each iteration improves) or diverge (iterations get stuck on the same wrong reasoning). Convergence is more common when the feedback signal is strong; divergence is more common with weak signals. Bound iteration count; don't trust unbounded "keep reflecting".

Combining them

Pick by problem shape. Search-y problem → ToT. Tool-using problem → ReAct. Iterative-improvement problem with verifiable feedback → Reflexion. The combinations are powerful: ReAct + Reflexion (an agent that uses tools and reflects on its mistakes) is more capable than either alone.

The decision tree. Does the task need external information or actions? If yes, you need ReAct. Does the task have verifiable correctness (code, math)? If yes, Reflexion adds value. Is the task search-flavored with multiple paths? If yes, ToT helps. The three are orthogonal; pick the ones that match your task.

The compute trade-off. Naive combination (ToT + ReAct + Reflexion) is 50-200x the compute of single-shot. For real applications, you choose techniques based on quality requirements per unit cost. The right combination is usually 1-2 techniques, not all three.

The framework reality. LangChain, LlamaIndex, and AutoGen all implement these patterns. You don't need to implement from scratch. What you DO need is the judgment of which pattern fits your problem; framework choice is downstream of that.

Common antipatterns

Using ToT on every problem. 5-50x compute cost; only justified for hard search problems. Don't pay the ToT tax on easy problems.

ReAct without tool grounding. ReAct that "thinks" but doesn't actually call tools collapses to expensive CoT. Verify the action loop is actually exercising tools.

Unbounded Reflexion loops. Without an iteration cap, divergent loops burn compute and produce no improvement. Cap at 3-5 iterations; exit early on convergence.

Picking technique by hype. Match technique to problem shape; don't just use the latest paper's method. CoT is still right for many problems.

What to do this week

Three moves. (1) Categorise your top 3 LLM use cases by problem shape (search-y, tool-using, iterative). The categorisation tells you which technique should improve which case. (2) For one case, run a CoT baseline and a chosen technique side-by-side on a 50-example eval. The quality-cost trade-off becomes concrete. (3) Set per-task compute budgets, "this task gets at most 3 ReAct iterations", so a runaway loop doesn't burn the budget on edge cases.