Alerts Use Historical Baseline

Compare current to historical for anomaly detection.

Static vs baseline thresholds

Static thresholds (“alert if CPU > 80%”) work for capacity but fail for traffic, latency, and error rates that vary by hour or day. Baseline thresholds compare current values to historical for the same window (“alert if request rate is 3x the median for this minute over the last 14 days”); default to baseline for seasonal signals and static for capacity ceilings.

How to baseline

The tooling supports baselining natively. Datadog forecast and outlier monitors, Prometheus PromQL with predict_linear and quantile windows, Nova AI Ops anomaly detection; use a 14-28 day window because shorter misses weekly seasonality and longer is slow to react; compare to same time of day and same day of week.

When baselining fails

Baselining has predictable failure modes. First-time-of-year events (Black Friday, tax season, World Cup) where the baseline has no data and the alert fires constantly; hard-coded exclusions for known events via Datadog or Nova AI Ops seasonality overrides; after a major change (marketing campaign, viral mention, product launch), reset the baseline manually rather than trusting auto-fit.

The tuning loop

Tuning is continuous. Track false-positive rate per baselined alert (above 10% means the window or sensitivity is wrong); track false-negative rate by running synthetic chaos that injects a real anomaly and confirms the alert fires within 2 minutes; review monthly because baselines drift and tuning is permanent overhead.

Pick by signal type

The pick is signal-driven. Traffic, request rate, error count: baseline. Disk, memory, connection pool, queue depth: static. Latency: hybrid (baseline for slow drift, static for SLO breaches). The signal’s shape drives the choice.