Cost Anomaly Detection Configuration
AWS Cost Anomaly Detection finds unusual spend. The configuration that catches real anomalies without noise.
Setup
Cost anomaly detection is the discipline of catching cost spikes before they show up on the monthly bill. AWS Cost Anomaly Detection, GCP Cost Recommendations, and similar tools provide the underlying detection; the configuration determines how useful the detection is. Sloppy configuration produces alert fatigue or missed anomalies; thoughtful configuration produces actionable signal.
What good setup looks like:
- One monitor per service or per team.: Each monitor is scoped to a logical group: a service, a team, an account. The scope is narrow enough that anomalies surface cleanly without aggregation. A single monitor for the entire organization buries small per-service anomalies.
- Granular enough to be actionable.: The granularity matches the team that needs to act. Service-level monitors page service owners; team-level monitors page team leads. Granularity that does not align with ownership produces alerts nobody owns.
- Severity threshold: $100/day or 30% above baseline.: The threshold determines what counts as an anomaly. Common defaults are $100/day or 30% above baseline; teams adjust based on service size and tolerance. Smaller services use smaller dollar thresholds; bigger services use larger ones.
- Tuneable.: The thresholds are revisited periodically. Initial thresholds are guesses; observed alert volume informs adjustments. Too many alerts means thresholds are too low; too few means thresholds are too high.
- Tag-based scoping.: Monitors can be scoped by tag (team:platform, service:api-gateway). The tag-based scoping aligns with the cost allocation strategy; if costs are tagged correctly, anomaly detection follows naturally.
The setup is the foundation. Without thoughtful setup, the detection produces noise instead of signal.
Alerting
The alerting layer determines who learns about anomalies and how. Alerts that go to the right people quickly drive fast remediation; alerts that go to a generic mailbox or to central finance produce slow or no response.
- Alerts to the team that owns the service.: The team that can fix the anomaly is the team that gets the alert. Service owner sees the spike; service owner investigates; service owner fixes. The accountability is direct.
- Not central finance.: Central finance does not own the spending decisions; they cannot fix the anomaly. Routing to finance produces escalation chains that delay action. The team that spent the money is the team that gets the alert.
- Daily summary in chat.: A daily summary in the team chat consolidates anomalies. One message per day rather than one alert per anomaly; the cognitive load is bounded; the patterns are visible.
- One alert per anomaly across the org.: Deduplication prevents repeated alerts for the same anomaly. The team is alerted once when the anomaly starts; not repeatedly as it persists. Persistent anomalies escalate via a different path.
- Channel routing by severity.: Small anomalies go to a low-priority channel; large anomalies go to the team's main channel; very large anomalies page directly. The channel routing matches the urgency.
The alerting strategy is what turns detection into action. The strategy must respect the team's attention budget; aggressive alerting produces fatigue and the next anomaly is ignored.
Response
Detection and alerting without response is wasted work. The response is what produces the cost outcome. The discipline is investigating each alert and either explaining it (legitimate) or ticketing it (needs remediation).
- Each alert: explained or ticketed.: Every anomaly has an outcome. Either it is explained (a known cost change, a planned migration, a feature launch) or it is ticketed (an investigation that needs follow-up). No alert is left in limbo.
- Within 24 hours.: The response time is bounded. Within 24 hours of the alert, the team has investigated and either explained or ticketed. The fast response prevents accumulation; small anomalies do not become large ones.
- Patterns emerge.: Over time, patterns emerge. Some anomalies are recurring (monthly batch jobs, seasonal traffic, planned campaigns). The team learns which are normal and adjusts.
- Some anomalies become known and suppressed.: Recurring anomalies that are intentional and well-understood can be suppressed. The suppression is documented; the team knows why; future reviews can revisit.
- Track the savings.: When investigation produces a fix that reduces cost, the savings are tracked. The cumulative savings demonstrate the program's value; leadership sees the ROI of the discipline.
Cost anomaly detection config is one of those FinOps disciplines that pays off proportionally to the investment. Nova AI Ops integrates with cost data from AWS, GCP, Azure, surfaces anomalies attributed to specific services and teams, and produces the response queue that turns detection into savings.