Multi-Step Tool Use: The Planning Problem
A single tool call is easy. Five tool calls in sequence, where each depends on the last, is the hardest open problem in agent design.
Where it breaks
Multi-step tool use means: the model must pick tools, sequence them, pass outputs of one as inputs to another, and synthesise a final answer. As of 2026, models reliably handle 2-3 step tasks. Beyond ~5 steps, error rates compound and most models fail. The bottleneck is planning, keeping track of "what have I done, what's left, what's the path forward".
The compounding error problem. If each tool call has 90% success rate, 5 sequential calls have 0.9^5 = 59% end-to-end success. 10 calls = 35%. The compounding makes long chains unreliable even when each step is competent.
The state-tracking problem. Beyond a few steps, the model loses track of: which tools it's already called, which results matter, what the current sub-goal is. Symptoms: re-calling the same tool with the same arguments, ignoring prior results, going in circles.
The 2026 ceiling. Frontier models (Claude, GPT, Gemini at top tier) handle 5-8 steps reliably for well-structured tasks. Beyond that, success rates drop substantially. This ceiling has moved up year over year; expect it to keep moving, but plan around the current state.
Plan-then-execute
The model writes an explicit plan (numbered steps), then executes step by step. The plan is the model's "external memory", referenced and updated as execution proceeds. Plan-then-execute outperforms reactive ReAct on tasks with 5+ steps because the plan keeps the model focused.
The structure. Phase 1 (plan): the model reads the task and writes a plan: "1. Search for X. 2. Use the result to query Y. 3. Synthesise Z." Phase 2 (execute): the model executes step 1, then step 2, etc. After each step, optionally update the plan based on results.
Why it works. The plan is in the prompt context throughout execution. The model doesn't have to re-derive what to do next; it consults the plan. The plan also surfaces problems early, a bad plan is visible before tool calls happen, allowing replanning.
The replanning question. Should the plan be fixed or revisable? Fixed plans are simpler but break when reality differs from plan-time expectations. Revisable plans are more robust but introduce complexity (when to revise, how to track plan-vs-execution divergence). Most production systems start fixed and add revision when fixed-plan failures dominate.
The compute cost. Plan-then-execute uses a planning call plus per-step calls. Total cost is roughly 1.2-1.5x the cost of pure ReAct. The quality improvement on long chains is usually >10%, so the cost is justified for tasks beyond ~5 steps.
Reactive ReAct
The model decides each next step based on what it just observed. Lower planning overhead; works for short chains; degrades fast on long ones because it lacks a global plan to reference.
When ReAct wins. Tasks where the next step truly depends on the prior observation. Search-then-summarise: you don't know what to summarise until you've searched. Triage-then-fix: you don't know what to fix until you've diagnosed. For 2-3 step reactive tasks, pure ReAct is the right choice.
When ReAct loses. Tasks with a clear multi-step structure that doesn't depend much on observations. "Generate a report from these 5 data sources" benefits from a plan; ReAct just wanders through the sources without an organising principle.
The hybrid as default. Start tasks with ReAct unless you know they need planning. If reactive runs fail predictably (look at failure modes), upgrade to plan-then-execute. The cost of upgrading is small once you have telemetry on failure modes.
Hybrid
The pattern that works in production: plan upfront, execute with reactive flexibility, replan when execution diverges from plan. Combines the planning benefit (long-chain coherence) with the reactive benefit (handles surprises).
The structure. Plan phase produces step list. Execute phase walks the steps; each step is itself a small ReAct loop (the agent might use multiple tool calls to complete one plan step). After each step completes, check whether the plan still makes sense given results; if not, replan the remainder.
The replan trigger. Common triggers: "I expected X but got Y, the next step assumes X". "A new sub-task emerged that wasn't in the plan". "The plan's later steps are no longer needed". The trigger detection is itself an LLM call; it's cheap relative to executing a wrong plan.
The infrastructure. Plan-execute-replan needs explicit state tracking. The system records: what was the original plan, what's been executed, what results came back, what's the current plan version. Without state tracking, replanning becomes confused.
The framework support. AutoGen, LangGraph, and CrewAI all support hybrid patterns. The framework provides the state machine; you provide the agent prompts and tool definitions.
The 2026 ceiling
Honest assessment: agents handle 5-8 step structured tasks well. Beyond that, error rates climb. The ceiling has moved up every year, the 2024 ceiling was 3-4 steps. By 2027, expect 10-15 step tasks to be reliable. By 2028, longer. For now, design tasks to fit within the current ceiling.
The model improvement vector. Each generation handles longer chains. Improvements come from: bigger models with more in-context working memory, better training on tool-using data, post-training that explicitly rewards multi-step coherence. All three are compounding.
The infrastructure improvement vector. Better state tracking, better replanning, better verification reduce the impact of model errors. Even if base models stayed fixed, infrastructure could push effective ceiling higher.
The product implication. Build for the current ceiling, not the projected one. Tasks designed for 15-step chains won't work today; they'll work in 2027. Tasks designed for 5-step chains work today and will keep working as ceilings rise. Conservative design ages better.
The fallback strategy. For tasks beyond the ceiling, decompose into independent sub-tasks. Run each as its own short chain. Synthesise results in a final pass. The decomposition is hand-coded; the model handles each sub-task. This pattern scales further than any single-agent chain.
Common antipatterns
Pure ReAct on 10-step tasks. Predictable failure, model loses track. Use plan-then-execute or decompose.
Plan-then-execute without replan triggers. Plan-then-execute is rigid; reality differs from plan; agent ploughs through the wrong plan. Add replan triggers when prior step results matter.
No iteration cap. Stuck loops. Always cap iterations; surface the cap to operators when hit.
One mega-prompt with all tools. The model has to figure out which tool when. Better: smaller toolsets per agent, with hierarchical agents that delegate. Specialist agents outperform generalist ones for multi-step work.
What to do this week
Three moves. (1) Look at your longest-running agent task. Count the typical step count. If above 5, add planning. If above 10, decompose. (2) Add iteration caps and replan triggers if you don't have them. The first time a stuck loop burns $100 of compute is the moment you wish you had. (3) Track per-step success rates. The bottleneck step is where to spend optimisation effort; the global success rate is too aggregate to act on.