AWS Cost Anomaly Triage: Agent Patterns

Cost-Explorer flags an anomaly. The agent that pulls the lineage, decides whether to ticket, and writes the cost summary to Slack.

Trigger from Cost Explorer

The agent triggers on AWS Cost Anomaly Detection events. AWS flags an anomaly and the agent receives the service, account, date range, dollar amount; the agent’s first action is to pull the resource-level cost breakdown for that service in the date range; identify the resources contributing most to the anomaly because usually one or two account for >80% of the spike.

Cost Anomaly Detection event. Service, account, date range, dollar amount; the trigger.
Resource-level breakdown first. The agent pulls the per-resource cost data.
1-2 resources dominate. >80% of spike; the focus.
Per-anomaly investigation start. The agent narrows from service to resource quickly.

Cost lineage

Lineage routes the ticket to the right team. Walk back from the resource to the service (“this S3 bucket is owned by service X”; tags, inventory, or naming conventions provide the link); walk back from the service to the team (“service X is owned by team Y”); the lineage is the routing because the team that owns the service should see the ticket.

Resource to service. S3 bucket to owning service via tags, inventory, naming.
Service to team. Service catalog provides the team mapping.
Lineage = routing. The team that owns the service should see the ticket.
Per-resource ownership audit. Untagged resources surfaced; supports continued ownership clarity.

Ticket vs no-ticket

Three cases distinguish ticket-worthy anomalies. Some anomalies are explained: a planned data ingestion, a scheduled batch, a new feature launch (match the anomaly’s timing against known activities; if it matches, no ticket). Unexplained anomalies get a ticket (includes the anomaly summary, resource breakdown, team owner, recommended next step). False positive rate matters because tickets the team closes as “expected” make the agent less trusted over time (tune the matching against known activities).

Explained anomalies: no ticket. Match against planned activities; if matched, skip.
Unexplained: ticket with full context. Summary, resource breakdown, team, next step.
False positive rate matters. “Expected” closures erode trust; tune matching.
Per-team match calibration. Each team’s known activities tuned; supports correct triage.

Cost summary to Slack

Daily summary keeps the team aware. Daily summary in the cost-monitoring channel (total spend, week-over-week, anomalies detected, tickets filed); the summary is a 3-line message not a wall of text because engineers scan they do not read; click-through to a fuller dashboard for those who want detail.

Daily summary in channel. Total spend, WoW, anomalies, tickets filed.
3-line message. Not a wall of text; engineers scan.
Click-through to dashboard. Detail one click away for those who want it.
Per-day rhythm. Same time each day; supports the habit.

Learning from explanations

The agent learns from team feedback. When a team explains an anomaly (“this was a planned data load”), the agent records the explanation; future anomalies that match the same pattern (same resource, similar timing) auto-suppress so the team explains once and the agent remembers; suppression has a TTL of 90 days because patterns shift and explanations expire.

Record team explanations. “Planned data load” saved as pattern.
Auto-suppress matching patterns. Same resource, similar timing; team explains once.
90-day suppression TTL. Patterns shift; explanations expire; re-ask.
Per-pattern learning compounds. Each explanation reduces future noise; supports continued improvement.