Agentic SRE Advanced By Samson Tanimawo, PhD Published Mar 16, 2026 5 min read

CloudTrail-Driven Triage: An Agent Pattern

Most cloud incidents have a CloudTrail event you missed. The agent that walks the trail, builds the causal chain, and writes the explanation.

Walk the trail

CloudTrail records every API call to AWS. For most cloud incidents, the cause is in the trail; the question is finding it.

The agent walks back from the affected resource: who called what, when, what changed.

Walk depth: typically 1 hour back. Most causes appear in that window; deeper walks rarely add value but multiply tokens.

Building the causal chain

Start: the symptom (e.g., service was down at 14:00).

Layer 1: events on the service's resources in the 30 minutes before the symptom.

Layer 2: events on dependencies of those resources in the 60 minutes before the symptom.

The chain is built bottom-up. Each layer narrows the candidate causes.

Writing the explanation

Two paragraphs. Paragraph 1: what changed. Paragraph 2: why that caused the symptom.

Avoid speculation. The agent only writes facts that are in the trail; if the trail does not show it, the agent does not claim it.

Confidence per claim. "The IAM policy was modified at 13:42 (high confidence). This caused the failed authentications at 13:50 (medium confidence)."

What the trail cannot show

Application bugs: code changes that did not touch AWS resources. CloudTrail does not see git.

Network problems: most network issues do not generate CloudTrail events. Different signal source.

Lambda runtime errors: visible in CloudWatch logs, not CloudTrail. The agent integrates both.

Output to the on-call

Top 3 candidate causes, ranked. Each with the supporting CloudTrail events linked.

Recommended next step: "investigate the IAM change at 13:42; the previous version is in CloudTrail history if you need to revert."

Always linked to the actual CloudTrail entries. The on-call can verify the agent's claim in one click.