Agentic SRE Advanced By Samson Tanimawo, PhD Published Jul 22, 2026 5 min read

The Agent Skeleton You Should Steal for Your First SRE Agent

A 200-line Python skeleton with the agent loop, tool registry, eval harness, and observability hooks already wired. What to keep, what to swap, and where to extend.

What's in the skeleton

A 200-line Python file with: an Agent class with the loop, a ToolRegistry with the three starter tools, an Eval harness that runs against a YAML test suite, and an observability module that logs each step in structured JSON.

Nothing exotic. No new framework. No bespoke abstractions. Just the parts every production agent needs, wired up so a new engineer can read it in one sitting.

The skeleton is a starting point, not a destination. Most teams replace half of it within three months as their needs become specific. That is fine; the skeleton has earned its keep by getting them past day one.

The loop, in code

The loop is 30 lines: while not done and within budget, call the model, parse the structured output, dispatch tool calls, append results to working memory, repeat. The skeleton has explicit nodes for verify, bound-check, and escalate.

Each node is a method with a typed signature. They can be replaced individually. The skeleton's surface area is small; the customisation surface is large.

The loop returns a structured result object: the final hypothesis, the actions taken, the evidence gathered, the cost, the latency. This object is what gets logged and what feeds the eval harness.

The starter tools

MetricQuery: a wrapper around your metrics backend. Takes a metric name, a service tag, a time window. Returns a small JSON object. Tightly scoped; throws if the request is too broad.

RecentEvents: returns the last N deploys, config changes, or feature-flag flips for a service. Sorted by timestamp; bounded to the last 24 hours by default.

LogSearch: a wrapper around your logs backend. Takes a query, a service tag, a time window. Returns the top matching lines, with a hard cap on output size. The cap is the most important feature.

The eval harness

A YAML file with test cases. Each case has an input (alert payload, metrics window) and an expected output (hypothesis, actions). The harness runs the agent against each case and reports pass/fail plus deltas.

The harness is run on every PR that touches the agent. CI fails if the eval suite regresses by more than the configured tolerance. This is the only way to keep prompt engineering disciplined.

Cases are added over time, especially after every incident the agent handled badly. The harness grows with the agent's deployment history.

The observability module

Every step in the loop emits a structured log line: timestamp, run_id, step_index, action, tool_name, latency_ms, tokens_in, tokens_out, cost_usd. The log line is the unit of observability.

The logs feed three dashboards: cost, latency, and correctness. The correctness dashboard pairs agent output with human-validated ground truth where available.

Logging is opinionated and minimal. There is no dump-the-prompt mode by default; that is a debug feature you enable on demand. Logs are tactical, not exhaustive.

What to do this week

Clone the skeleton (or write your own version of it from this article). Replace the placeholder tools with your own. Add five test cases from your real incident history. Run the eval. The first eval failure is a PR; the first eval pass is a deploy.