Anomaly Detection vs Static Thresholds
Two alert approaches. Decision by workload pattern.
Where static thresholds win
Static thresholds win on contractual numbers and stable workloads. SLA values like 99.9% availability or 200ms p99 latency are contractual; the threshold is the contract. Stable workloads where traffic is predictable within 20% have static thresholds catch real outliers cheaply, and the on-call understands the trigger without reading ML output.
- SLA values. 99.9% availability, 200ms p99 latency, 5% error rate; contractual numbers, the threshold is the contract.
- Stable workloads. If traffic is predictable within 20%, a static threshold catches real outliers.
- Cheap to debug. Cheap to define, cheap to debug; the on-call understands the trigger without reading ML output.
- Per-threshold ownership. Each static threshold has an owner who can defend the value; supports investigation.
Where anomaly detection wins
Anomaly detection wins on seasonal traffic, per-tenant variation, and cardinality-heavy metrics. E-commerce during holidays, payroll on the 15th, and weekday-vs-weekend patterns all have shape that static thresholds cannot capture; per-tenant variation needs per-series baselines that anomaly detection produces automatically.
- Seasonal traffic. E-commerce during holidays, payroll on the 15th, weekday vs weekend patterns.
- Per-tenant variation. A static threshold for the global metric misses tenant-specific outages.
- Cardinality-heavy metrics. Per-series static thresholds are impractical; anomaly detection produces baselines automatically.
- Per-region baselines. Region-specific traffic shapes captured automatically; supports global-and-regional view.
The trade-off
Anomaly detection comes with costs. Higher default false-positive rate without careful sensitivity tuning; harder to debug because “why did this fire?” needs model output, not just a number; and tooling lock-in because Datadog Watchdog, Prometheus MAD, and GCP MQL are not interchangeable.
- Higher false-positive rate. Tune sensitivity carefully or you trade noisy static alerts for noisy ML alerts.
- Debugging difficulty. “Why did this fire?” needs the model output, not just a number.
- Tooling lock-in. Datadog Watchdog, Prometheus MAD, GCP MQL; switching tools means rewriting alerts.
- Per-tool calibration cost. Each tool needs its own sensitivity tuning; the team carries the calibration cost.
Hybrid is usually right
Hybrid alerting is usually the right answer. Static thresholds on contractual SLAs and known dangerous values (disk > 90%, queue > 10k); anomaly detection on traffic-shape metrics where the normal range varies hourly or seasonally; the static alerts that work do not need replacement.
- Static for contracts and dangers. SLAs and known dangerous values (disk > 90%, queue > 10k); the threshold is unambiguous.
- Anomaly for traffic-shape. Metrics where the normal range varies hourly or seasonally; the baseline must be learned.
- Don’t replace what works. The static alerts that work are not the problem; preserve them.
- Per-metric mode documented. Each metric’s alert mode (static or anomaly) committed to the alert config; supports clear operations.
How to pick per metric
The pick is metric-specific. Known dangerous value (SLA, capacity limit) is a static threshold; strong seasonality is anomaly detection with a seasonality model; per-tenant variation is anomaly detection with per-tenant baselines. Match the technique to the metric’s shape.
- Known dangerous value. SLA or capacity limit; static threshold.
- Strong seasonality. Anomaly detection with seasonality model; the baseline tracks the cycle.
- Per-tenant variation. Per-tenant baselines via anomaly detection; supports tenant-specific outage detection.
- Per-metric decision recorded. The technique choice committed to the metric’s alert config; supports later review.