State Machines vs Goal-Driven Agents in SRE
Goal-driven agents are flexible but unpredictable. State machines are predictable but brittle. The hybrid that production SRE teams actually ship and how to choose between them.
The two axes that pick the architecture
Two axes: how predictable the path is, and how broad the goal is. Predictable path + narrow goal = state machine. Unpredictable path + broad goal = goal-driven agent. The two diagonals are where most production work lives, and where the hybrid wins.
Most SRE work has a predictable path for the common case and an unpredictable path for the long tail. A pure state machine handles the common case beautifully and falls over on the long tail. A pure goal-driven agent handles the long tail but burns budget on the common case.
The hybrid: the state machine handles the path it knows, with a single state called "unknown" that hands off to a goal-driven agent for the long tail. This pattern is what production-grade SRE platforms actually run.
When the state machine wins
If the runbook has fewer than 20 steps and a clear branching tree, encode it as a state machine. You get determinism, debuggability, and dirt-cheap latency. The model is invoked at decision points, not orchestration points.
State machines are the right shape for high-frequency, well-understood incidents: cert renewal, scheduled scaling, deploy promotion. The path is the same every time; the only variation is in the data.
The cost of a state machine is rigidity. When reality changes, the state machine has to change too. This is fine for stable workflows; it is painful for evolving ones.
When the goal-driven agent wins
If the input space is open and the path is not known in advance, you need a goal-driven agent. Triage of an unfamiliar incident, exploratory cost analysis, postmortem drafting; these have no single canonical path.
Goal-driven agents are also the right shape when the workflow is going to evolve fast. You change the goal; the agent adapts. With a state machine, you are rebuilding the FSM every quarter.
The cost of a goal-driven agent is unpredictability. The agent might wander. Good goal-driven agents have heavy bounds, eval suites, and human escalation paths to compensate.
The hybrid pattern in code
An FSM with one state called "agent" that delegates to a goal-driven agent when the FSM does not know what to do. The agent has access to the same tools as the FSM. When the agent finishes, it returns a state name, and the FSM resumes from there.
This shape gives you the FSM's predictability for 80% of cases and the agent's flexibility for the remaining 20%. Cost stays low because the agent only fires when needed. Latency stays low because the FSM does not invoke the model unnecessarily.
The handoff from FSM to agent and back is where the bugs live. Be explicit about what the agent gets (current state, history) and what it can return (next state, intermediate results). Tight contracts; loud failures.
How the architecture evolves with the team
Year one teams usually start with a goal-driven agent because it is more flexible. By year two, the patterns that show up frequently get promoted into FSM states. By year three, the agent is the FSM with one state for the unknown.
This evolution is healthy. It means the team is learning the workload and codifying that knowledge. Resist the inverse path: starting with an FSM and adding a goal-driven escape hatch only when the FSM gets too painful.
Document the architecture decisions as ADRs. "Why is cert renewal an FSM and triage a goal-driven agent?" is a question your future engineers will ask, and the ADR is your future self answering them.
What to do this week
Look at your top 5 agent workflows by volume. For each, ask: is the path predictable enough to be an FSM? If yes for 3 of 5, you are probably running goal-driven where state machines would be cheaper and more reliable. Convert one this quarter and measure the latency / cost / reliability deltas.