Alert Dependency Fragility

Alerts depend on metrics, integrations. When deps break, alerts go quiet.

Hidden alert dependencies

Every alert depends on at least 3 things: the metric pipeline, the evaluator, and the paging integration. Most also depend on a runbook URL, a dashboard, and a team mapping. When any dependency breaks, the alert may go silent without a visible failure (Prometheus drops a target, the metric stops, the alert never fires); inventory the dependencies per alert and make them explicit in the alert metadata.

Alert on absent data

Absent-data alerts are the safety net. Pair every metric-based alert with an absent-data alert (Prometheus has absent(), Datadog has the no-data state); if the metric stops flowing for 15 minutes, fire a meta-alert to the owning team that catches dropped scrape targets and broken exporters; without absent-data alerts, a broken pipeline silently invalidates everything downstream.

Integration rot

Integrations rot. PagerDuty integration keys rotate, Slack webhooks expire, SSO logins expire and break read-only dashboard links; test integrations on a schedule (alert canary covers paging path, add health checks for dashboard links and runbook URLs); build a quarterly link-check job that files tickets for broken URLs.

Test alerts in CI

Alert configs in git are testable. Use Prometheus promtool test rules, Datadog terraform plan, or unit tests against PagerDuty’s API; CI catches typo in label selectors, missing runbook URL, malformed severity, missing owner team; combine with a staging Prometheus that re-runs the rule against last week’s metrics to catch semantic regressions, not just syntax.

Where to invest first

The investment ramp is concrete. Add absent-data alerts to the top 20 critical metrics this week (a few hours of work, catches the most damaging silent failures); add CI on alert config repos within a month (linting plus dry-run); add the alert canary as the safety net so three layers protect: CI on config, absent-data on metrics, canary on paging path.