Agentic SRE Advanced By Samson Tanimawo, PhD Published Jul 19, 2026 5 min read

The Pre-Flight Check Pattern for Production Agents

Before any high-risk action, the agent runs a checklist: am I in the right environment, do I have approval, is the blast radius understood. The list, in code.

Why pre-flight, every time

An action without a pre-flight is a bug waiting to happen. The agent's view of the world might be stale, the environment might have changed, the prerequisites might not be met. Pre-flight catches all three before the action fires.

The cost of a pre-flight is small (a few extra tool calls, a few seconds). The cost of a wrong action is large (an incident, a postmortem, eroded trust). The trade is one-sided.

Pre-flight is the cheapest safety mechanism the agent has. It does not require extra approvals, does not require a sandbox, does not slow down the team. It just runs before every action and refuses on failure.

What goes on the checklist

Environment check. Is the agent acting in the environment it thinks it is? Production vs staging vs dev. The check is a single tool call; the answer is a string. The agent refuses if the answer surprises it.

Authorisation check. Does the agent have permission to take this action? IAM identity, scoped role, expected to be able to do exactly this. If the check fails, surface the gap; do not retry.

Blast radius check. What does this action affect? If the affected set is larger than the agent expected, refuse. "Restart pod" should affect one pod; if the count comes back as 50, something is wrong.

Reversibility check. Is the action reversible? Some actions become irreversible only after they fire (deletes, terminates). Pre-flight surfaces the reversibility honestly so the human approver can weigh it.

Design the pre-flight as code

Pre-flight checks are not LLM calls; they are deterministic functions. Write them in Python (or whatever your agent's host language is). Each check is 5-20 lines; the whole pre-flight is under 200 lines.

Each check returns a structured result: pass / fail / unknown. "Unknown" is treated as fail by default. The aggregate is pass-only-if-all-pass; one fail aborts the action.

Checks are independent. Each can be tested in isolation. The pre-flight harness runs them in parallel where possible; sequential where one depends on another.

Integrate into the loop

The loop calls preflight() before every act() call. preflight() returns a result object that act() consumes. If preflight fails, act() does not fire; the loop logs the refusal and either escalates or stops.

Pre-flight failures are observable. Each failure has a reason; reasons aggregate into a dashboard. The most common failure modes get fixed first; the rare ones serve as the safety net they were designed to be.

Pre-flight is not a one-time setup; it is a continuous discipline. New tools get new pre-flight checks. Removed tools get their checks removed. Treat it as part of the tool's contract.

When pre-flight fails, then what

Default behaviour: log the failure, abort the action, escalate to a human. The human reads the failure reason and decides whether to override. Override is itself a tracked event.

Some failures are recoverable. Stale environment data: refresh and retry pre-flight. Transient auth issue: retry with backoff. Define the recoverable failures explicitly; everything else is a hard stop.

Track override frequency. If humans are overriding pre-flight failures often, the pre-flight is too strict; tune it. If humans never override, the pre-flight is well-calibrated.

What to do this week

Pick the riskiest action your agent can take. Add the four-check pre-flight from this article. Run it for a week in parallel (logging only, not enforcing). Look at the failures: would they have caught real problems, would they have blocked real work? Tune, then enforce.