Designing the Agent Loop for Production SRE
Observe → think → act is the textbook loop. Production needs five more nodes. The loop shape that handles real incidents without burning compute or making things worse.
The textbook loop is incomplete for SRE
Observe → think → act is the canonical agent loop. It is fine for chatbots. It is dangerous for SRE. Production needs three more nodes: verify (did the action have the intended effect), bound (have I exceeded my budget or scope), and escalate (do I need a human).
Without verify, the agent acts blindly. The action might fail silently, succeed but miss the goal, or succeed and create a new problem. Verify is the cheapest insurance the loop can buy.
Without bound, the agent can spin. Without escalate, the agent fails alone. Both are required to make the loop production-grade rather than demo-grade.
Verify is more than a status check
After every action, the loop should check three things: did the action complete (status code or equivalent), did the metric the agent was trying to move actually move, did any other metric move that should not have moved.
The third check is the one most teams skip. It is the one that catches the side effect: the agent fixed the latency by killing the slow query, but accidentally also broke the dashboard that was reading the same connection pool. Side-effect detection is what verify is for.
Make verify cheap. A 5-second wait plus three metric reads is the right shape. Anything heavier and the agent feels slow; anything lighter and verify becomes ceremonial.
Bounds keep stuck agents from spinning
Three bounds matter: action count (how many actions can the agent take in this run), token budget (how much can it spend on LLM calls), and wall-clock (how long can it run). Hit any one and the agent stops.
Set bounds tight at first. The first time a bound trips on a real run, you learn something about the workload. If bounds trip on legitimate runs, raise them. If they only trip on stuck runs, hold them tight.
Surface bound trips in observability. A run that hit its action cap is a run you should review. Fifty runs per week hitting the cap is a signal the agent is consistently underestimating scope.
Escalation is a first-class loop node
The loop should always have an exit to human. Stuck for too long: escalate. Confidence below threshold: escalate. Action would be irreversible without prior approval: escalate. Each is a node, not a special case.
Escalation is not failure; it is the agent doing its job. The metric to track is: of escalations, how many should have been handled by the agent (false escalations) and how many were correctly identified (true escalations). The first number should trend down as you improve the agent.
Escalation should include context: what the agent tried, what it observed, what its hypothesis is, where it is stuck. The human picks up where the agent left off, not from scratch. That handoff context is half the value of having an agent at all.
What the full loop looks like in code
while not done and not bound_exceeded(): observe → think → maybe_escalate → bound_check → act → verify → maybe_escalate → bound_check → repeat. Eight nodes, not three. The extra five are what separate production from demo.
Each node is a small function with explicit inputs and outputs. Each can be tested independently. Each can be replaced without touching the others. This shape is more code, but the code is simpler than the alternative.
Most teams who try a 3-node loop end up adding the other 5 within a month. Saving the round trip is the right move. Build the 8-node loop on day one.
What to do this week
Map your current agent loop. Count the nodes. If it is fewer than 8, identify which of verify / bound / escalate are missing. Add the smallest version of each. Re-run the eval suite; cost should be similar, the action-count distribution should tighten, and your worst-case run should be dramatically less bad.