Multi-Step Tool Use: The Planning Problem
A single tool call is easy. Five tool calls in sequence, where each depends on the last, is the hardest open problem in agent design.
Where it breaks
Single tool calls work because the prompt fully describes the situation. Five calls in sequence don’t, because each call’s output adds information that the model has to integrate into its plan.
The classic failure: the agent gets results from step 2 that should change its plan, but it sticks to the original plan from step 1 because that plan is the loudest thing in context.
Plan-then-execute
One approach: the model writes the full plan up front, then executes each step. Predictable; easy to audit. Doesn’t handle surprises.
Best for tasks with knowable structure: data extraction, deterministic workflows, scheduled jobs.
Reactive ReAct
The opposite: no upfront plan. Each step looks at current state and decides the next action. Handles surprises gracefully. Tends to wander on long tasks.
Best for exploratory tasks where the steps aren’t known: debugging, research, customer support.
Hybrid
The pattern that’s emerged in production: write a high-level plan, execute reactively within each step, replan if results contradict assumptions.
Frameworks like AutoGPT, BabyAGI, and the modern crop of agent SDKs implement variants. None is perfect; all work better than pure-plan or pure-react in their target domains.
The 2026 ceiling
Production agents top out at roughly 10-20 sequential tool calls before reliability collapses. Beyond that, the cumulative probability of a wrong call dominates.
Mitigations: smaller, focused agents that hand off to others (multi-agent); intermediate verification steps; hard limits on iteration count.
The next breakthrough, if and when it comes, will likely be on this axis. Models that can plan, monitor, and replan reliably for hundreds of steps are the unlock for many real-world AI applications.