Alert on Anomalies, Not Norms

Don't alert on normal behaviour. Alert on deviations.

The rule

The rule is uncompromising. Don’t alert on what is normal; alert on what is unusual. “Service is up” is not an alert; “service is down” is. Many teams set alerts on healthy-state thresholds without considering whether those thresholds represent anomalies or daily traffic; anomaly-based alerts produce 5-10x less noise than threshold-based alerts on the same signal.

Detecting normal

Detecting normal is statistical. Pull 14 days of data and compute median, p95, p99 per minute-of-day; plot the distribution because the bands at p5-p95 are “normal” and anything outside is candidate alert territory; use the same distribution to set alert thresholds (p99 plus 2 stddev is a defensible upper bound).

Detecting anomalies

Anomaly detection has a hierarchy. Statistical methods first (Holt-Winters, Prophet, Datadog seasonal monitors handle 80% of cases); ML models second only if the signal has hidden structure (multivariate, conditional, most signals don’t); always include a rate-of-change check because a signal that moves from p50 to p99 in 2 minutes is anomalous even if p99 is “normal”.

Common traps

Three traps catch teams. Alerting on absolute zero traffic (zero is not anomalous on Saturday night for an internal admin tool, use rate-of-change); alerting on a single high reading (one-minute spikes are noise, 3 consecutive minutes is signal); forgetting to update baselines after launches (post-launch traffic doubles the baseline and old thresholds become permanent false positives).

How to apply

The application is iterative. Audit your noisiest 5 alerts and replace fixed thresholds with anomaly-based logic; test in shadow mode for 2 weeks and promote when the false-positive rate is under 5%; repeat for the next 5 because the noise budget shrinks each cycle.