CloudTrail-Driven Triage: An Agent Pattern

Most cloud incidents have a CloudTrail event you missed. The agent that walks the trail, builds the causal chain, and writes the explanation.

Walk the trail

CloudTrail records every API call to AWS, and for most cloud incidents the cause is in the trail; the question is finding it. The agent walks back from the affected resource: who called what, when, what changed; walk depth is typically 1 hour back because most causes appear in that window and deeper walks rarely add value but multiply tokens.

Building the causal chain

The causal chain is built bottom-up in layers. Start: the symptom (e.g., service was down at 14:00). Layer 1: events on the service’s resources in the 30 minutes before the symptom. Layer 2: events on dependencies of those resources in the 60 minutes before the symptom. Each layer narrows the candidate causes.

Writing the explanation

Two-paragraph explanation. Paragraph 1: what changed. Paragraph 2: why that caused the symptom. Avoid speculation: the agent only writes facts that are in the trail and if the trail does not show it the agent does not claim it; confidence per claim (“the IAM policy was modified at 13:42 (high confidence). This caused the failed authentications at 13:50 (medium confidence)”).

What the trail cannot show

Three blind spots deserve recognition. Application bugs (code changes that did not touch AWS resources; CloudTrail does not see git); network problems (most network issues do not generate CloudTrail events; different signal source); Lambda runtime errors (visible in CloudWatch logs, not CloudTrail; the agent integrates both).

Output to the on-call

The output is structured. Top 3 candidate causes ranked, each with the supporting CloudTrail events linked; recommended next step (“investigate the IAM change at 13:42; the previous version is in CloudTrail history if you need to revert”); always linked to the actual CloudTrail entries so the on-call can verify the agent’s claim in one click.