Alert Cost Tracking

Each alert has a cost. Track it.

What an alert costs

An alert costs more than the per-rule fee. Direct cost: SaaS per-rule fees from Datadog, New Relic, PagerDuty (a noisy rule firing 1000 times a month consumes events you pay for); indirect cost: pager interruption time (a single page costs ~25 minutes of focused work even when noise); capacity cost: alert evaluation load (high-cardinality rules with 1m intervals burn CPU continuously).

How to track

Tracking is mechanical. Tag every rule with owner_team, service, and tier; aggregate fire count by tag weekly; pull alert fires from Alertmanager /api/v2/alerts and push to a metrics backend (Datadog and PagerDuty expose similar APIs); compute cost per alert as ($vendor_fee + $on_call_minutes * fires) / 30 days and publish a top-10 most-expensive list.

Acting on the data

Action turns the data into outcomes. Highest-cost rule each month gets a mandatory review (fix or delete); tie alert cost to team budgets so SRE pays the bill until the owning team takes ownership; reject new rules from teams whose existing rules are in the top quintile by cost.

Realistic savings

The savings are predictable. A 200-rule catalog typically has 15-25 rules consuming over half the noise budget; removing those drops vendor event counts by 40-60% (PagerDuty and Datadog both bill events directly); on-call satisfaction scores rise within one rotation cycle because the cost-per-page metric makes the trade visible to leadership.

Start small

The starter ramp is concrete. Week 1: collect fire counts, no actions. Week 2: publish the top-20 list to engineering. Week 3: delete or fix the top 5. Skip vendor-supplied noise reduction features until the catalog is clean because they mask the problem rather than solve it; make the fire-count dashboard public because visibility is the cheapest intervention.