CloudTrail-Driven Triage: An Agent Pattern
Most cloud incidents have a CloudTrail event you missed. The agent that walks the trail, builds the causal chain, and writes the explanation.
Walk the trail
CloudTrail records every API call to AWS, and for most cloud incidents the cause is in the trail; the question is finding it. The agent walks back from the affected resource: who called what, when, what changed; walk depth is typically 1 hour back because most causes appear in that window and deeper walks rarely add value but multiply tokens.
- CloudTrail records all API calls. Cause for most cloud incidents lives in the trail.
- Walk back from affected resource. Who called what, when, what changed.
- 1-hour walk depth. Most causes appear in that window; deeper rarely adds value.
- Per-walk token budget. Bounded by depth; supports cost control.
Building the causal chain
The causal chain is built bottom-up in layers. Start: the symptom (e.g., service was down at 14:00). Layer 1: events on the service’s resources in the 30 minutes before the symptom. Layer 2: events on dependencies of those resources in the 60 minutes before the symptom. Each layer narrows the candidate causes.
- Start: symptom. Service was down at 14:00; the anchor.
- Layer 1: 30 minutes back on resources. Direct events on the affected service’s resources.
- Layer 2: 60 minutes back on dependencies. Events on dependencies; widens the search.
- Bottom-up narrows candidates. Each layer reduces the candidate set.
Writing the explanation
Two-paragraph explanation. Paragraph 1: what changed. Paragraph 2: why that caused the symptom. Avoid speculation: the agent only writes facts that are in the trail and if the trail does not show it the agent does not claim it; confidence per claim (“the IAM policy was modified at 13:42 (high confidence). This caused the failed authentications at 13:50 (medium confidence)”).
- Paragraph 1: what changed. The fact extracted from the trail.
- Paragraph 2: why it caused. The causal link to the symptom.
- No speculation. Only facts in the trail; the agent doesn’t fabricate.
- Confidence per claim. High for facts; medium for inferred causality.
What the trail cannot show
Three blind spots deserve recognition. Application bugs (code changes that did not touch AWS resources; CloudTrail does not see git); network problems (most network issues do not generate CloudTrail events; different signal source); Lambda runtime errors (visible in CloudWatch logs, not CloudTrail; the agent integrates both).
- Application bugs. Code changes not touching AWS resources; CloudTrail can’t see git.
- Network problems. Most don’t generate CloudTrail events; need network signal source.
- Lambda runtime errors. Live in CloudWatch logs not CloudTrail; agent integrates both.
- Per-blind-spot fallback. Each blind spot has a documented alternate signal source.
Output to the on-call
The output is structured. Top 3 candidate causes ranked, each with the supporting CloudTrail events linked; recommended next step (“investigate the IAM change at 13:42; the previous version is in CloudTrail history if you need to revert”); always linked to the actual CloudTrail entries so the on-call can verify the agent’s claim in one click.
- Top 3 ranked candidates. Each with supporting CloudTrail events linked.
- Recommended next step. “Investigate IAM change at 13:42; previous version in history”.
- Linked to actual entries. One-click verification; the agent’s claim is checkable.
- Per-output replay artifact. The output captures provenance; supports postmortem.