Alerts Use Historical Baseline
Compare current to historical for anomaly detection.
Static vs baseline thresholds
Static thresholds (“alert if CPU > 80%”) work for capacity but fail for traffic, latency, and error rates that vary by hour or day. Baseline thresholds compare current values to historical for the same window (“alert if request rate is 3x the median for this minute over the last 14 days”); default to baseline for seasonal signals and static for capacity ceilings.
- Static for capacity. “CPU > 80%”; works when the threshold is a hard physical limit.
- Static fails for seasonal. Traffic, latency, error rates vary by hour or day; static produces noise.
- Baseline compares to history. “3x median for this minute over last 14 days”; the seasonal-aware comparison.
- Per-signal default. Baseline for seasonal signals, static for capacity ceilings; the routing rule.
How to baseline
The tooling supports baselining natively. Datadog forecast and outlier monitors, Prometheus PromQL with predict_linear and quantile windows, Nova AI Ops anomaly detection; use a 14-28 day window because shorter misses weekly seasonality and longer is slow to react; compare to same time of day and same day of week.
- Vendor support. Datadog forecast/outlier, Prometheus
predict_linear, Nova AI Ops anomaly detection. - 14-28 day window. Shorter misses weekly seasonality; longer is slow to react to real traffic shifts.
- Same time-of-day, same day-of-week. “Last 4 Tuesdays at 14:00-15:00” beats “last 14 days flat”.
- Per-baseline tuning. Window and sensitivity tuned per signal; supports correct firing.
When baselining fails
Baselining has predictable failure modes. First-time-of-year events (Black Friday, tax season, World Cup) where the baseline has no data and the alert fires constantly; hard-coded exclusions for known events via Datadog or Nova AI Ops seasonality overrides; after a major change (marketing campaign, viral mention, product launch), reset the baseline manually rather than trusting auto-fit.
- First-time-of-year events. Black Friday, tax season, World Cup; baseline has no data, alert fires constantly.
- Seasonality overrides. Hard-coded exclusions in Datadog and Nova AI Ops; supports known event windows.
- Reset after major change. Marketing campaign, viral mention, product launch; don’t trust auto-fit.
- Per-event documented exclusion. Each excluded window committed to the rule config; supports auditability.
The tuning loop
Tuning is continuous. Track false-positive rate per baselined alert (above 10% means the window or sensitivity is wrong); track false-negative rate by running synthetic chaos that injects a real anomaly and confirms the alert fires within 2 minutes; review monthly because baselines drift and tuning is permanent overhead.
- False-positive rate target. Above 10% means window or sensitivity is wrong.
- False-negative chaos. Inject a real anomaly; confirm alert fires within 2 minutes.
- Monthly review. Baselines drift; tuning is permanent overhead, not one-time work.
- Per-month tuning record. Documented changes per cycle; supports continued accuracy.
Pick by signal type
The pick is signal-driven. Traffic, request rate, error count: baseline. Disk, memory, connection pool, queue depth: static. Latency: hybrid (baseline for slow drift, static for SLO breaches). The signal’s shape drives the choice.
- Baseline signals. Traffic, request rate, error count; the seasonal patterns.
- Static signals. Disk, memory, connection pool, queue depth; the capacity patterns.
- Hybrid for latency. Baseline for slow drift, static for SLO breaches; both layers needed.
- Per-signal documented choice. The technique committed to the rule config; supports later review.