Alert Investment Priorities

Where to invest alert engineering time. The ROI ranking.

The question

Alert engineering time is finite. Where does the next hour of tuning produce the biggest return? Most teams default to whichever alert fired last night, which is reactive not optimal. Rank by page volume times severity times customer impact because the top 3 alerts usually drive 60% of on-call pain.

Finite engineering time. The next hour of tuning has an opportunity cost; pick the highest-return target.
Default is reactive. Whichever alert fired last night; rarely optimal.
Rank by volume × severity × impact. The composite metric that captures real pain.
Top 3 drives 60% pain. The investment focuses where the return is biggest.

Priority ranking

Three tiers cover most cases. Tier 1: alerts that page more than once per shift on average (fix first; the noise that destroys rotations). Tier 2: alerts with low MTTA but no clear customer impact (waste cognitive cycles). Tier 3: alerts that fire monthly but always require novel debugging (invest in runbooks, not detection).

Tier 1: more than once per shift. Fix first; the noise that destroys rotations.
Tier 2: low MTTA, no impact. Wastes cognitive cycles even when fast to ack.
Tier 3: monthly novel debugging. Invest in runbooks, not detection.
Per-tier investment ratio. Tier 1 gets most of the time; the lower tiers get matched effort.

Where to spend the time

Three patterns yield most return. Tighten thresholds first because 60% of noise comes from thresholds set during the panic of an old incident and never revisited; add customer-impact context next to reduce triage time even when the alert keeps firing; build dependency suppression last because it’s high leverage but expensive to set up.

Tighten thresholds first. 60% of noise from old-incident-panic thresholds; the highest-return move.
Customer-impact context next. Reduces triage time even if the alert keeps firing.
Dependency suppression last. High leverage but expensive; only after the basics are clean.
Per-pattern measurement. Each pattern’s return measured; supports continued investment.

Where not to spend

Three traps waste alert engineering time. ML anomaly detection on a fundamentally noisy signal (cleaning the signal source pays better than smarter detection); custom dashboards for alerts that are about to be deleted (don’t gold-plate doomed pages); tooling that requires a vendor migration to deliver (migration cost almost always exceeds the alert ROI).

ML on noisy signal. Cleaning the source pays better than smarter detection.
Dashboards for doomed alerts. Don’t gold-plate pages about to be deleted.
Vendor-migration tooling. Migration cost almost always exceeds the alert ROI.
Per-trap awareness. The traps documented in the alert engineering handbook; supports avoidance.

Apply this week

The application is concrete. Pull last 30 days of pages and rank by volume; pick the top 3; allocate 4 hours per top alert (tune threshold, add impact text, link runbook, measure for 2 weeks); repeat monthly because the top 3 rotates but the discipline does not.

30-day page pull. Rank by volume; pick the top 3.
4 hours per alert. Tune threshold, add impact text, link runbook; measure for 2 weeks.
Monthly repeat. Top 3 rotates; the discipline does not.
Per-month investment record. Documented per cycle; supports continued accountability.