Alert Dependency Fragility
Alerts depend on metrics, integrations. When deps break, alerts go quiet.
Hidden alert dependencies
Every alert depends on at least 3 things: the metric pipeline, the evaluator, and the paging integration. Most also depend on a runbook URL, a dashboard, and a team mapping.
When any dependency breaks, the alert may go silent without a visible failure. Prometheus drops a target, the metric stops, the alert never fires.
Inventory the dependencies per alert. Make them explicit in the alert metadata.
Alert on absent data
Pair every metric-based alert with an absent-data alert. Prometheus has `absent()`; Datadog has the no-data state.
If the metric stops flowing for 15 minutes, fire a meta-alert to the owning team. This catches dropped scrape targets and broken exporters.
Without absent-data alerts, a broken pipeline silently invalidates everything downstream.
Integration rot
PagerDuty integration keys rotate. Slack webhooks expire. SSO logins expire and break read-only dashboard links.
Test integrations on a schedule. The alert canary covers the paging path; add health checks for dashboard links and runbook URLs.
Build a quarterly link-check job. Broken runbook URLs and dashboard links file tickets to the owning team.
Test alerts in CI
Alert configs in git are testable. Use Prometheus's `promtool test rules`, Datadog's `terraform plan`, or unit tests against PagerDuty's API.
CI catches: typo in label selectors, missing runbook URL, malformed severity, missing owner team.
Combine with a staging Prometheus that re-runs the rule against last week's metrics. Catches semantic regressions, not just syntax.
Where to invest first
Add absent-data alerts to the top 20 critical metrics this week. That is a few hours of work and catches the most damaging silent failures.
Add CI on alert config repos within a month. Linting + dry-run.
Add the alert canary as the safety net. Three layers: CI on config, absent-data on metrics, canary on paging path.