AWS Cost Anomaly Triage: Agent Patterns

Cost-Explorer flags an anomaly. The agent that pulls the lineage, decides whether to ticket, and writes the cost summary to Slack.

Trigger from Cost Explorer

The agent triggers on AWS Cost Anomaly Detection events. AWS flags an anomaly and the agent receives the service, account, date range, dollar amount; the agent’s first action is to pull the resource-level cost breakdown for that service in the date range; identify the resources contributing most to the anomaly because usually one or two account for >80% of the spike.

Cost lineage

Lineage routes the ticket to the right team. Walk back from the resource to the service (“this S3 bucket is owned by service X”; tags, inventory, or naming conventions provide the link); walk back from the service to the team (“service X is owned by team Y”); the lineage is the routing because the team that owns the service should see the ticket.

Ticket vs no-ticket

Three cases distinguish ticket-worthy anomalies. Some anomalies are explained: a planned data ingestion, a scheduled batch, a new feature launch (match the anomaly’s timing against known activities; if it matches, no ticket). Unexplained anomalies get a ticket (includes the anomaly summary, resource breakdown, team owner, recommended next step). False positive rate matters because tickets the team closes as “expected” make the agent less trusted over time (tune the matching against known activities).

Cost summary to Slack

Daily summary keeps the team aware. Daily summary in the cost-monitoring channel (total spend, week-over-week, anomalies detected, tickets filed); the summary is a 3-line message not a wall of text because engineers scan they do not read; click-through to a fuller dashboard for those who want detail.

Learning from explanations

The agent learns from team feedback. When a team explains an anomaly (“this was a planned data load”), the agent records the explanation; future anomalies that match the same pattern (same resource, similar timing) auto-suppress so the team explains once and the agent remembers; suppression has a TTL of 90 days because patterns shift and explanations expire.