Instrumenting Your SRE Agent: What to Log
Token usage. Tool calls. Decisions. Failures. The structured log schema that makes debugging tractable, with the field-by-field rationale.
The log schema
Every step emits a structured log: timestamp, run_id, step_index, agent_role, action_type, action_args, tool_name, tool_args, model_name, tokens_in, tokens_out, latency_ms, cost_usd, status.
The schema is enforced by a validator. Logs that do not match are dropped or repaired; the schema does not bend.
Add fields conservatively. Each field is a maintenance burden; only add when a known consumer needs it.
Cardinality discipline
Run_id is high cardinality (one per run); fine. Tool_name is low cardinality (a handful per agent); fine. Action_args, if logged raw, can explode cardinality.
Hash high-cardinality string fields when used as labels. Keep the raw value in the body for debugging.
Watch the cardinality of the labels you use in dashboards. A blown-up cardinality breaks the dashboard and your bill.
Retention policy
Hot logs (last 7 days): full detail, queryable from the dashboard.
Warm logs (7-90 days): downsampled, sometimes anonymised. Queryable but slower.
Cold logs (90+ days): archived to object storage. Used only for compliance audits or deep historical analysis.
What to redact
Customer identifiers: hash before logging. Internal hostnames: hash if they identify private services. Secrets: never log.
Compliance frameworks (SOC2, HIPAA, PCI) constrain what can be logged. Build the redaction layer to be pluggable; redactors are a frequently-updated piece.
Test redaction. Periodically audit logs for fields that should have been redacted; fix the redactor when leaks are found.
Common queries you should optimise for
"All steps in this run": index on run_id.
"Cost by agent role over the last day": aggregate index on agent_role + timestamp.
"Tool call failures by tool name": index on tool_name + status. The single most useful query for debugging agent behaviour.