Alert Cost Tracking
Each alert has a cost. Track it.
What an alert costs
An alert costs more than the per-rule fee. Direct cost: SaaS per-rule fees from Datadog, New Relic, PagerDuty (a noisy rule firing 1000 times a month consumes events you pay for); indirect cost: pager interruption time (a single page costs ~25 minutes of focused work even when noise); capacity cost: alert evaluation load (high-cardinality rules with 1m intervals burn CPU continuously).
- Direct vendor cost. SaaS per-rule fees; Datadog, New Relic, PagerDuty; events you pay for.
- Indirect interruption cost. ~25 minutes of focused work per page even when noise.
- Capacity evaluation cost. High-cardinality rules with 1m intervals burn CPU continuously.
- Per-rule TCO. Vendor plus interruption plus capacity; the full cost picture.
How to track
Tracking is mechanical. Tag every rule with owner_team, service, and tier; aggregate fire count by tag weekly; pull alert fires from Alertmanager /api/v2/alerts and push to a metrics backend (Datadog and PagerDuty expose similar APIs); compute cost per alert as ($vendor_fee + $on_call_minutes * fires) / 30 days and publish a top-10 most-expensive list.
- Per-rule tags. owner_team, service, tier; the basic attribution surface.
- Weekly fire-count aggregation. By tag; the cadence that catches drift.
- API-driven pull. Alertmanager
/api/v2/alerts, Datadog, PagerDuty; push to metrics backend. - Cost formula.
($vendor_fee + $on_call_minutes * fires) / 30 days; publish top-10.
Acting on the data
Action turns the data into outcomes. Highest-cost rule each month gets a mandatory review (fix or delete); tie alert cost to team budgets so SRE pays the bill until the owning team takes ownership; reject new rules from teams whose existing rules are in the top quintile by cost.
- Monthly highest-cost review. Mandatory; fix the underlying issue or delete the rule.
- Cost-to-team-budget tie. SRE pays the bill until the owning team takes ownership.
- Reject new rules from top quintile. Teams must clean up existing rules before adding more.
- Per-action accountability. Each acted-on rule has a documented outcome; supports follow-through.
Realistic savings
The savings are predictable. A 200-rule catalog typically has 15-25 rules consuming over half the noise budget; removing those drops vendor event counts by 40-60% (PagerDuty and Datadog both bill events directly); on-call satisfaction scores rise within one rotation cycle because the cost-per-page metric makes the trade visible to leadership.
- 15-25 rules dominate. Out of 200; consuming over half the noise budget.
- 40-60% event reduction. Removing the dominant rules; PagerDuty and Datadog bill events directly.
- Rotation-cycle satisfaction lift. On-call satisfaction rises within one rotation; the trade is visible.
- Per-quarter savings tracked. Documented per cycle; supports continued investment.
Start small
The starter ramp is concrete. Week 1: collect fire counts, no actions. Week 2: publish the top-20 list to engineering. Week 3: delete or fix the top 5. Skip vendor-supplied noise reduction features until the catalog is clean because they mask the problem rather than solve it; make the fire-count dashboard public because visibility is the cheapest intervention.
- Week 1: collect. Fire counts; no actions; the data first.
- Week 2: publish top-20. Engineering sees the list; the visibility creates pressure.
- Week 3: delete or fix top 5. First action; visible win.
- Public dashboard. Visibility is the cheapest intervention; the discipline lives in plain sight.