Alert Dependency Graph
Alerts depend on metrics, services, integrations. Map the graph.
Why map dependencies
Alerts depend on metrics, metrics depend on exporters, exporters depend on services, services depend on integrations: break any link and the alert silently stops. Without a graph, the first time you discover the node-exporter pod has been gone for two months is during an outage when the host alert never fires. Treat the alert catalog as a dependency graph, not a flat list.
- Multi-layer dependency. Alerts depend on metrics, exporters, services, integrations; a chain.
- Silent breakage. Break any link and the alert stops; you find out during an outage.
- Discovery during outage. Two-month-gone exporter discovered when the host alert never fires.
- Graph not list. Treat the alert catalog as a dependency graph; the relationships are what matter.
What's in the graph
The graph has well-defined nodes and edges. Nodes: alerts, recording rules, metric names, exporters, services, datasources. Edges: alert depends_on rule, rule depends_on metric, metric exposed_by exporter, exporter runs_on service, service ingested_by datasource. Store in Neo4j or flat YAML; the discipline of writing it down is what matters.
- Node types. Alerts, recording rules, metric names, exporters, services, datasources.
- Edge types. depends_on, exposed_by, runs_on, ingested_by; the relationship vocabulary.
- Storage choice. Neo4j or flat YAML committed to the repo; either works.
- Per-node owner. Each node has an owner team; supports investigation when a node breaks.
What the graph unlocks
The graph unlocks three concrete benefits. Blast radius for a metric rename (renaming http_requests_total breaks 14 alerts; the graph lists them); health checking the dependency chain via a monitoring synthetic that walks the graph and catches broken exporters before they break alerts; onboarding because new engineers see relationships rather than isolated rules.
- Blast radius for rename. Renaming
http_requests_totalbreaks 14 alerts; the graph lists them. - Synthetic chain check. Walks the graph; catches broken exporters before they break alerts.
- Onboarding. New engineers see relationships, not isolated rules.
- Per-incident dependency view. Investigation shows the chain from alert to exporter; supports faster triage.
Automating the graph
Automation builds the graph cheaply. Parse Prometheus rules with promtool to extract metric references and build metric-to-rule edges; pull exporter health from /metrics scrape success and build metric-to-exporter edges from scrape labels; pull service ownership from CMDB or Backstage rather than reinventing service catalog data.
- promtool for metric refs. Parse Prometheus rules; build metric-to-rule edges automatically.
- Scrape labels for exporters. Pull from /metrics scrape success; build metric-to-exporter edges.
- CMDB or Backstage for ownership. Service ownership pulled; don’t reinvent service catalog data.
- Per-source automation. Each edge type has a source-driven build; supports continuous freshness.
Worth it above 100 rules
The investment threshold is rule count. Below 100 rules, the catalog is small enough to keep in your head and a dependency graph is overkill; above 100, the graph pays for itself the first time a renamed metric breaks pages silently; skip the visual UI tooling because a queryable JSON file is enough.
- Below 100: skip. Catalog small enough to keep in your head; graph is overkill.
- Above 100: pays back. First renamed-metric incident pays for the graph; the threshold is real.
- Queryable JSON enough. Skip the visual UI tooling; the JSON is the operational interface.
- Per-cohort scaling. The graph scales with rule count; the cost-benefit shifts predictably.