SRE Tools By Samson Tanimawo, PhD Published Apr 30, 2025 9 min read

How to Write a Runbook an AI Agent Can Execute Without Breaking Prod

The difference between a runbook a human reads and a runbook an agent executes comes down to three structural rules. Miss them and the agent stops, or worse, guesses.

The three structural rules

Human runbooks assume the reader can improvise. “Check the logs”, the operator knows which logs, where, and what “normal” looks like. Agent-executable runbooks cannot rely on that.

Preconditions as code

Instead of “make sure the incident is a database slowness issue,” write:

precondition: db.primary.replication_lag_seconds > 30
precondition: db.primary.cpu_percent < 90
precondition: service.error_rate_5m < 0.2

Each is a check the agent can run, and each has a clear failure mode: if the precondition fails, the agent halts and hands off to a human with the exact line that failed.

Steps that are idempotent

Agents retry. Steps must tolerate being run twice. “Create a failover replica” becomes “ensure a failover replica exists; if one exists matching this spec, do nothing.”

Every step declares two outputs: a success state (measurable, not vibes) and a failure signal that triggers the halt.

Exit conditions and halt points

Every step answers: how do I know it worked? And every runbook answers: under what conditions do I halt and page a human?

Common halt conditions include: more than 3 consecutive step retries, any change that would touch more than N resources, any action outside the declared blast radius, any step that fails its exit check twice.

A template you can copy

id: db-failover-v3
blast_radius: one database cluster
preconditions:
  - db.primary.replication_lag_seconds > 30
steps:
  - name: confirm_standby_healthy
    check: db.standby.replication_lag_seconds < 5
    halt_on_fail: true
  - name: promote_standby
    action: rds.promote_read_replica(target=db.standby)
    idempotent: true
    exit_check: db.standby.role == "primary"
  - name: reroute_traffic
    action: route53.update(record=db.endpoint, value=db.standby.address)
    exit_check: dig +short db.endpoint == db.standby.address
halt_conditions:
  - any step fails exit_check twice
  - total runtime exceeds 300s

Write three of your most-run runbooks this way this quarter. The first one takes a day; the third takes an hour.

Human runbooks assume the reader can improvise. Agents cannot.

3
structural rules
3x
time saved by the third runbook

Your first three runbooks

Pick the three most-run runbooks in your ops history. Not the most dramatic. The most frequent.

Rewrite each in the structured form: preconditions as code, steps with explicit idempotency, exit checks per step, blast radius declared at the top.

The first takes a day. The second takes half a day. The third takes an hour. By the fourth, the format is second nature and the agent can execute end-to-end with human approval only on high-blast-radius steps.