Alert Routing by Data
Alert routing based on payload data, not just labels.
Routing by alert content
Data-driven routing inspects the alert payload and routes by labels: env, region, service, severity, team. Alertmanager’s tree of routes plus matchers covers most cases; Datadog’s monitor tags do the same. The reason to bother: one rule definition can serve many teams without manual receiver duplication.
- Static vs data-driven. Static sends to fixed receivers; data-driven inspects the alert payload and routes by labels.
- Routing surface. env, region, service, severity, team; the labels carry the routing intent.
- Tooling. Alertmanager tree plus matchers; Datadog monitor tags; both cover most cases.
- Reuse benefit. One rule definition serves many teams without manual receiver duplication.
Labels to route on
Pick the labels with care. Required: severity, service, env. Optional: region, cluster, team, customer_tier. Avoid volatile labels like instance or pod_name because they create unmanageable receiver explosions; validate label values because a typo in env=staging vs environment=staging silently misroutes pages.
- Required labels. severity, service, env; the minimum routing surface.
- Optional labels. region, cluster, team, customer_tier; supports more granular routing where needed.
- Avoid volatile labels. instance and pod_name create unmanageable receiver explosions.
- Validate label values. Typos like env=staging vs environment=staging silently misroute pages.
Routing tree pattern
The routing tree usually splits in three layers. Top splits on severity (sev1 to PagerDuty, sev2 to Slack, sev3 to email queue); middle splits on team (each team has its own service or channel); leaf splits on env (production to on-call, non-prod to a daytime channel).
- Top: severity. Sev1 to PagerDuty, sev2 to Slack channel, sev3 to email queue; the urgency split.
- Middle: team. Each team has its own PagerDuty service or Slack channel; the ownership split.
- Leaf: env. Production routes to on-call, non-prod routes to a daytime channel; the environment split.
- Per-tree depth limit. Three levels max; beyond that the tree becomes unmaintainable.
Managing routing changes
Routing config is code. Pull request, review, CI tests with promtool config check or Datadog terraform validate; runbook for routing changes during incidents because mid-incident edits are a known cause of missed escalations; snapshot live config nightly to S3 so diffs against last known good are possible.
- Routing config as code. Pull request, review, CI tests with
promtool config checkor Datadog terraform validate. - Mid-incident edits dangerous. Known cause of missed escalations; runbook required for routing changes during incidents.
- Nightly config snapshot. Live config snapshotted to S3; diff against last known good when something breaks.
- Per-change review. Routing changes reviewed like code; supports correct delivery.
Keep the tree shallow
Three rules keep routing maintainable. Three levels max because beyond that the tree becomes unmaintainable; skip data-driven routing if the team owns under 30 alerts because static config is fine; don’t route on customer-specific labels unless contractually required because per-customer pagers is the path to madness.
- Three levels max. Beyond that the tree becomes unmaintainable.
- Static OK under 30 alerts. Data-driven routing pays back at scale; under 30 alerts, static config is fine.
- Avoid customer-specific labels. Per-customer pagers is the path to madness; only when contractually required.
- Per-routing review. Annual review of routing depth and label use; supports continuous simplification.