Alert Dependency Fragility
Alerts depend on metrics, integrations. When deps break, alerts go quiet.
Hidden alert dependencies
Every alert depends on at least 3 things: the metric pipeline, the evaluator, and the paging integration. Most also depend on a runbook URL, a dashboard, and a team mapping. When any dependency breaks, the alert may go silent without a visible failure (Prometheus drops a target, the metric stops, the alert never fires); inventory the dependencies per alert and make them explicit in the alert metadata.
- 3+ dependencies per alert. Metric pipeline, evaluator, paging integration; plus runbook, dashboard, team.
- Silent failure mode. Dropped target produces stopped metric produces silent alert.
- Inventory dependencies. Make them explicit in alert metadata; supports investigation.
- Per-alert dependency graph. The chain documented; supports monitoring of the monitoring.
Alert on absent data
Absent-data alerts are the safety net. Pair every metric-based alert with an absent-data alert (Prometheus has absent(), Datadog has the no-data state); if the metric stops flowing for 15 minutes, fire a meta-alert to the owning team that catches dropped scrape targets and broken exporters; without absent-data alerts, a broken pipeline silently invalidates everything downstream.
- Pair every metric alert. With absent-data;
absent()in Prometheus, no-data in Datadog. - 15-minute meta-alert. Metric stops flowing 15 minutes; meta-alert to owning team.
- Catches dropped targets. Broken exporters surface; the silent failure becomes loud.
- Per-pipeline safety. Without absent-data, a broken pipeline silently invalidates downstream.
Integration rot
Integrations rot. PagerDuty integration keys rotate, Slack webhooks expire, SSO logins expire and break read-only dashboard links; test integrations on a schedule (alert canary covers paging path, add health checks for dashboard links and runbook URLs); build a quarterly link-check job that files tickets for broken URLs.
- Key and webhook rotation. PagerDuty keys rotate; Slack webhooks expire.
- SSO expiration. Read-only dashboard links break silently.
- Alert canary covers paging path. Synthetic alert verifies the chain end-to-end.
- Quarterly link-check job. Broken runbook and dashboard URLs file tickets.
Test alerts in CI
Alert configs in git are testable. Use Prometheus promtool test rules, Datadog terraform plan, or unit tests against PagerDuty’s API; CI catches typo in label selectors, missing runbook URL, malformed severity, missing owner team; combine with a staging Prometheus that re-runs the rule against last week’s metrics to catch semantic regressions, not just syntax.
- promtool test rules. Prometheus rule unit tests; the canonical CI mechanism.
- terraform plan and PagerDuty API tests. Datadog and PagerDuty config validated.
- CI catches typos and missing fields. Label selectors, runbook URL, severity, owner team.
- Staging Prometheus replay. Last-week metrics re-evaluated; catches semantic regressions.
Where to invest first
The investment ramp is concrete. Add absent-data alerts to the top 20 critical metrics this week (a few hours of work, catches the most damaging silent failures); add CI on alert config repos within a month (linting plus dry-run); add the alert canary as the safety net so three layers protect: CI on config, absent-data on metrics, canary on paging path.
- Top 20 absent-data this week. Few hours; catches damaging silent failures.
- CI on config within a month. Linting plus dry-run; the second layer.
- Alert canary as safety net. Three layers total: CI, absent-data, canary.
- Per-layer measured. Each layer’s catch rate tracked; supports continued investment.