Instrumenting Your SRE Agent: What to Log

Token usage. Tool calls. Decisions. Failures. The structured log schema that makes debugging tractable, with the field-by-field rationale.

The log schema

The log schema is fixed and enforced. Every step emits a structured log: timestamp, run_id, step_index, agent_role, action_type, action_args, tool_name, tool_args, model_name, tokens_in, tokens_out, latency_ms, cost_usd, status. The schema is enforced by a validator and logs that do not match are dropped or repaired; add fields conservatively because each field is a maintenance burden.

14-field structured log. Timestamp through status; the canonical schema.
Validator-enforced. Logs that don’t match are dropped or repaired; schema does not bend.
Conservative addition. Each field is maintenance burden; only add when a known consumer needs it.
Per-field documented purpose. Every field has a documented consumer; supports continued discipline.

Cardinality discipline

Cardinality discipline keeps the bill in check. Run_id is high cardinality (one per run, fine); tool_name is low cardinality (a handful per agent, fine); action_args if logged raw can explode cardinality. Hash high-cardinality string fields when used as labels and keep the raw value in the body for debugging; watch the cardinality of the labels you use in dashboards because blown-up cardinality breaks the dashboard and your bill.

Run_id and tool_name fine. High and low cardinality respectively; both natural.
Action_args risk. Raw values explode cardinality; hash for labels.
Hash for labels, raw for body. Hash supports query cardinality control; raw supports debugging.
Per-dashboard cardinality watch. Blown cardinality breaks dashboard and bill.

Retention policy

Three retention tiers cover the lifecycle. Hot logs (last 7 days) carry full detail and are queryable from the dashboard; warm logs (7-90 days) are downsampled and sometimes anonymised, queryable but slower; cold logs (90+ days) are archived to object storage and used only for compliance audits or deep historical analysis.

Hot: 0-7 days. Full detail; queryable from dashboard; the active surface.
Warm: 7-90 days. Downsampled, sometimes anonymised; queryable but slower.
Cold: 90+ days. Archived to object storage; compliance audits and deep history.
Per-tier query cost. Hot fast and cheap; cold slow and expensive; matches usage patterns.

What to redact

Three categories of data deserve redaction. Customer identifiers (hash before logging); internal hostnames (hash if they identify private services); secrets (never log). Compliance frameworks (SOC2, HIPAA, PCI) constrain what can be logged so build the redaction layer to be pluggable because redactors are frequently updated; test redaction periodically by auditing logs for fields that should have been redacted.

Customer IDs hashed. Before logging; the standard PII protection.
Internal hostnames hashed. If they identify private services; protect topology.
Secrets never logged. No exception; the absolute rule.
Pluggable redactors. Compliance frameworks update; the layer must too.

Common queries you should optimise for

Three queries dominate agent debugging. “All steps in this run” needs index on run_id; “cost by agent role over the last day” needs aggregate index on agent_role plus timestamp; “tool call failures by tool name” needs index on tool_name plus status, which is the single most useful query for debugging agent behaviour.

All steps in run. Index on run_id; the basic replay query.
Cost by agent role over time. Aggregate index on agent_role + timestamp; the cost view.
Tool failures by tool name. Index on tool_name + status; the most useful debugging query.
Per-query index design. Each common query has a supporting index; supports performant ad-hoc analysis.