Agentic SRE Advanced By Samson Tanimawo, PhD Published Jun 16, 2026 5 min read

Instrumenting Your SRE Agent: What to Log

Token usage. Tool calls. Decisions. Failures. The structured log schema that makes debugging tractable, with the field-by-field rationale.

The log schema

Every step emits a structured log: timestamp, run_id, step_index, agent_role, action_type, action_args, tool_name, tool_args, model_name, tokens_in, tokens_out, latency_ms, cost_usd, status.

The schema is enforced by a validator. Logs that do not match are dropped or repaired; the schema does not bend.

Add fields conservatively. Each field is a maintenance burden; only add when a known consumer needs it.

Cardinality discipline

Run_id is high cardinality (one per run); fine. Tool_name is low cardinality (a handful per agent); fine. Action_args, if logged raw, can explode cardinality.

Hash high-cardinality string fields when used as labels. Keep the raw value in the body for debugging.

Watch the cardinality of the labels you use in dashboards. A blown-up cardinality breaks the dashboard and your bill.

Retention policy

Hot logs (last 7 days): full detail, queryable from the dashboard.

Warm logs (7-90 days): downsampled, sometimes anonymised. Queryable but slower.

Cold logs (90+ days): archived to object storage. Used only for compliance audits or deep historical analysis.

What to redact

Customer identifiers: hash before logging. Internal hostnames: hash if they identify private services. Secrets: never log.

Compliance frameworks (SOC2, HIPAA, PCI) constrain what can be logged. Build the redaction layer to be pluggable; redactors are a frequently-updated piece.

Test redaction. Periodically audit logs for fields that should have been redacted; fix the redactor when leaks are found.

Common queries you should optimise for

"All steps in this run": index on run_id.

"Cost by agent role over the last day": aggregate index on agent_role + timestamp.

"Tool call failures by tool name": index on tool_name + status. The single most useful query for debugging agent behaviour.