Instrumenting Your SRE Agent: What to Log

Token usage. Tool calls. Decisions. Failures. The structured log schema that makes debugging tractable, with the field-by-field rationale.

The log schema

The log schema is fixed and enforced. Every step emits a structured log: timestamp, run_id, step_index, agent_role, action_type, action_args, tool_name, tool_args, model_name, tokens_in, tokens_out, latency_ms, cost_usd, status. The schema is enforced by a validator and logs that do not match are dropped or repaired; add fields conservatively because each field is a maintenance burden.

Cardinality discipline

Cardinality discipline keeps the bill in check. Run_id is high cardinality (one per run, fine); tool_name is low cardinality (a handful per agent, fine); action_args if logged raw can explode cardinality. Hash high-cardinality string fields when used as labels and keep the raw value in the body for debugging; watch the cardinality of the labels you use in dashboards because blown-up cardinality breaks the dashboard and your bill.

Retention policy

Three retention tiers cover the lifecycle. Hot logs (last 7 days) carry full detail and are queryable from the dashboard; warm logs (7-90 days) are downsampled and sometimes anonymised, queryable but slower; cold logs (90+ days) are archived to object storage and used only for compliance audits or deep historical analysis.

What to redact

Three categories of data deserve redaction. Customer identifiers (hash before logging); internal hostnames (hash if they identify private services); secrets (never log). Compliance frameworks (SOC2, HIPAA, PCI) constrain what can be logged so build the redaction layer to be pluggable because redactors are frequently updated; test redaction periodically by auditing logs for fields that should have been redacted.

Common queries you should optimise for

Three queries dominate agent debugging. “All steps in this run” needs index on run_id; “cost by agent role over the last day” needs aggregate index on agent_role plus timestamp; “tool call failures by tool name” needs index on tool_name plus status, which is the single most useful query for debugging agent behaviour.