Alerts Practical By Samson Tanimawo, PhD Published Oct 30, 2025 4 min read

Alert on Anomalies, Not Norms

Don't alert on normal behaviour. Alert on deviations.

The rule

Don't alert on what is normal. Alert on what is unusual. "Service is up" is not an alert; "service is down" is.

Many teams set alerts on healthy-state thresholds (request rate above X, queue length above Y) without considering whether X and Y represent anomalies or daily traffic.

Anomaly-based alerts produce 5 to 10 times less noise than threshold-based alerts on the same signal.

Detecting normal

Pull 14 days of data. Compute the median, p95, and p99 per minute-of-day.

Plot the distribution. The bands at p5 to p95 are "normal". Anything outside is candidate alert territory.

Use this same distribution to set alert thresholds. p99 plus 2 stddev is a defensible upper bound; below that is normal.

Detecting anomalies

Statistical methods first. Holt-Winters, Prophet, and Datadog's seasonal monitors handle 80% of cases.

ML models second, only if the signal has hidden structure (multivariate, conditional). Most signals don't.

Always include a rate-of-change check. A signal that moves from p50 to p99 in 2 minutes is anomalous even if p99 is in the "normal" band.

Common traps

Alerting on absolute zero traffic. Zero is not anomalous on a Saturday night for an internal admin tool. Use rate-of-change.

Alerting on a single high reading. One-minute spikes are noise; 3 consecutive minutes is signal.

Forgetting to update baselines after launches. Post-launch traffic doubles the baseline; old thresholds become permanent false positives.

How to apply

Audit your noisiest 5 alerts. Replace fixed thresholds with anomaly-based logic.

Test in shadow mode for 2 weeks. Promote when the false-positive rate is under 5%.

Repeat for the next 5. Rinse, repeat. The noise budget shrinks each cycle.