Alert Source of Truth
Alerts defined in many places drift. Single source of truth.
The multi-tool problem
Most stacks have alerts in 3+ places: Datadog monitors, Prometheus rules, PagerDuty event rules, plus a few SaaS tools. Without a source of truth, alerts diverge: the same condition is alerted differently in each tool, or alerts exist in one tool nobody knows about. Pick one source of truth per layer and treat the others as derived.
- 3+ alerting tools typical. Datadog monitors, Prometheus rules, PagerDuty event rules, plus SaaS tools.
- Without source of truth, alerts diverge. Same condition alerted differently in each tool; ghost alerts in one tool nobody knows about.
- One source per layer. Pick one canonical source; treat the others as derived deployment targets.
- Per-stack source-of-truth choice. Documented per layer; supports investigation and consistency.
Git is the source of truth
Git is the right source of truth. Alert config in git, applied via Terraform, Pulumi, or a custom controller (Datadog provider, Prometheus rule files, PagerDuty Terraform); console edits are forbidden because CI runs terraform plan weekly to detect drift and the drift triggers a ticket to the offending team; branch-protected and code-reviewed like application code.
- Config in git. Alert config applied via Terraform, Pulumi, or custom controller; the canonical source.
- Console edits forbidden. CI runs
terraform planweekly; drift triggers a ticket to the offending team. - Branch-protected. Code-reviewed like application code; same workflow as the rest of engineering.
- Per-tool provider. Datadog provider, Prometheus rule files, PagerDuty Terraform; the deployment-time mapping.
The alert inventory
The inventory is generated, not maintained. Single inventory with alert name, owner team, severity, source tool, runbook URL, dashboard URL; generate from the git repo (canonical) rather than each tool’s API (deployment target); publish weekly so anyone can search for “who owns alert X” without asking around.
- Single-table inventory. Alert name, owner team, severity, source tool, runbook URL, dashboard URL.
- Generate from git. Git repo is canonical; API is the deployment target; the inventory follows the source.
- Publish weekly. Anyone can search for “who owns alert X” without asking around.
- Per-team ownership view. The inventory exposes per-team alert counts; supports the noise-attribution work.
Migrating off console-edited alerts
The migration is staged. Phase 1: import existing alerts via terraform import or equivalent (one-time bulk migration). Phase 2: lock down console write access (read-only for everyone except a break-glass account). Phase 3: monthly drift detection (drift means someone broke the rule, investigate).
- Phase 1: import.
terraform importor equivalent; one-time bulk migration. - Phase 2: lock down write. Console read-only for everyone except a break-glass account.
- Phase 3: drift detection. Monthly drift check; drift means someone broke the rule; investigate.
- Per-phase exit criterion. Each phase has documented criteria for moving on; supports steady migration.
When to pick the IaC approach
The decision is scale-driven. Above 50 alerts or 3+ teams, the IaC approach pays back within a quarter; below that, a documented spreadsheet inventory is enough; don’t try to manage alerts in 5 tools by hand because the drift will eat you.
- Above 50 alerts or 3+ teams. IaC approach pays back within a quarter; the threshold for investment.
- Below: spreadsheet enough. Documented spreadsheet inventory; the lower-cost option.
- Don’t hand-manage 5 tools. The drift will eat you; the manual ceiling is real.
- Per-org decision recorded. The IaC-vs-manual choice committed; supports later revisiting.