Alerts Practical By Samson Tanimawo, PhD Published Oct 24, 2025 4 min read

Alerts Use Historical Baseline

Compare current to historical for anomaly detection.

Static vs baseline thresholds

Static thresholds ("alert if CPU > 80%") work for capacity. They fail for traffic, latency, and error rates that vary by hour or day.

Baseline thresholds compare current values to historical for the same window. "Alert if request rate is 3x the median for this minute over the last 14 days".

Default to baseline thresholds for any signal with daily or weekly seasonality. Default to static for capacity ceilings.

How to baseline

Datadog forecast and outlier monitors, Prometheus PromQL with `predict_linear` and quantile windows, and Nova AI Ops anomaly detection all support this natively.

Use a 14 to 28 day window. Shorter windows miss weekly seasonality; longer windows are slow to react to genuine traffic shifts.

Compare to the same time of day, same day of week. "Last 4 Tuesdays at 14:00 to 15:00" beats "last 14 days flat".

When baselining fails

First-time-of-year events. Black Friday, tax season, World Cup. The baseline has no data; the alert fires constantly.

Hard-coded exclusions for known events: Datadog and Nova AI Ops both support seasonality overrides.

After a major change. New marketing campaign, viral mention, product launch. Reset the baseline manually; don't trust the auto-fit.

The tuning loop

Track false-positive rate per baselined alert. Above 10% means the window or sensitivity is wrong.

Track false-negative rate. Run synthetic chaos: inject a real anomaly and confirm the alert fires within 2 minutes.

Review monthly. Baselines drift; tuning is permanent overhead, not one-time work.

Pick by signal type

Traffic, request rate, error count: baseline.

Disk, memory, connection pool, queue depth: static.

Latency: hybrid. Baseline for slow drift, static for SLO breaches.