How to Write a Runbook an AI Agent Can Execute Without Breaking Prod
The difference between a runbook a human reads and a runbook an agent executes comes down to three structural rules. Miss them and the agent stops, or worse, guesses.
The three structural rules
Human runbooks assume the reader can improvise. “Check the logs”, the operator knows which logs, where, and what “normal” looks like. Agent-executable runbooks cannot rely on that.
- Every precondition is machine-checkable.
- Every step is idempotent, safe to retry, safe to re-run after a partial failure.
- Every step has an explicit exit condition and a halt point if the exit fails.
Preconditions as code
Instead of “make sure the incident is a database slowness issue,” write:
precondition: db.primary.replication_lag_seconds > 30
precondition: db.primary.cpu_percent < 90
precondition: service.error_rate_5m < 0.2
Each is a check the agent can run, and each has a clear failure mode: if the precondition fails, the agent halts and hands off to a human with the exact line that failed.
Steps that are idempotent
Agents retry. Steps must tolerate being run twice. “Create a failover replica” becomes “ensure a failover replica exists; if one exists matching this spec, do nothing.”
Every step declares two outputs: a success state (measurable, not vibes) and a failure signal that triggers the halt.
Exit conditions and halt points
Every step answers: how do I know it worked? And every runbook answers: under what conditions do I halt and page a human?
Common halt conditions include: more than 3 consecutive step retries, any change that would touch more than N resources, any action outside the declared blast radius, any step that fails its exit check twice.
A template you can copy
id: db-failover-v3
blast_radius: one database cluster
preconditions:
- db.primary.replication_lag_seconds > 30
steps:
- name: confirm_standby_healthy
check: db.standby.replication_lag_seconds < 5
halt_on_fail: true
- name: promote_standby
action: rds.promote_read_replica(target=db.standby)
idempotent: true
exit_check: db.standby.role == "primary"
- name: reroute_traffic
action: route53.update(record=db.endpoint, value=db.standby.address)
exit_check: dig +short db.endpoint == db.standby.address
halt_conditions:
- any step fails exit_check twice
- total runtime exceeds 300s
Write three of your most-run runbooks this way this quarter. The first one takes a day; the third takes an hour.
Human runbooks assume the reader can improvise. Agents cannot.
Your first three runbooks
Pick the three most-run runbooks in your ops history. Not the most dramatic. The most frequent.
Rewrite each in the structured form: preconditions as code, steps with explicit idempotency, exit checks per step, blast radius declared at the top.
The first takes a day. The second takes half a day. The third takes an hour. By the fourth, the format is second nature and the agent can execute end-to-end with human approval only on high-blast-radius steps.