Alert Tuning Cadence
Alerts rot. Tune them on a schedule.
Alerts need scheduled tuning
Threshold drift is real. Traffic grows, code changes, dependencies are swapped. Last year's threshold becomes this year's noise.
Schedule tuning. Quarterly is the floor; monthly for high-traffic teams.
Without a cadence, tuning happens reactively after a noisy weekend. Reactive tuning is worse than proactive: it skips the analysis.
Data-driven tuning
Pull 30-day metrics. Compute new threshold candidates: 99th percentile, 95th percentile, 3-sigma above mean.
Backtest each candidate against the last 90 days. How many alerts would have fired. How many real incidents would have been caught.
Pick the threshold that maximises true positives and caps false positives at the team's noise budget (e.g., 1 false positive per week per alert).
Automate where possible
Tools like Datadog Watchdog, Prometheus's `predict_linear`, and SLO burn-rate alerts adapt to changing traffic.
Adaptive alerts replace fixed thresholds for most metrics. Keep fixed thresholds for explicit SLA values (e.g., 99.9% availability).
Don't fully automate. Keep a human in the loop for threshold changes that affect paging.
Tuning review checklist
Did the alert fire in the last quarter. If no, why is it still here.
Of the fires, how many were actioned. Action rate below 50% means the threshold is wrong.
Has the underlying service changed shape (traffic, dependencies, code). If yes, threshold is suspect.
Where to start
Pick the 5 noisiest alerts this quarter. Tune them with backtested thresholds.
Move 3 fixed-threshold alerts to SLO burn-rate alerts.
Add tuning to the on-call review meeting. 15 minutes per alert; close the loop the same day.