Alert Source of Truth

Alerts defined in many places drift. Single source of truth.

The multi-tool problem

Most stacks have alerts in 3+ places: Datadog monitors, Prometheus rules, PagerDuty event rules, plus a few SaaS tools. Without a source of truth, alerts diverge: the same condition is alerted differently in each tool, or alerts exist in one tool nobody knows about. Pick one source of truth per layer and treat the others as derived.

3+ alerting tools typical. Datadog monitors, Prometheus rules, PagerDuty event rules, plus SaaS tools.
Without source of truth, alerts diverge. Same condition alerted differently in each tool; ghost alerts in one tool nobody knows about.
One source per layer. Pick one canonical source; treat the others as derived deployment targets.
Per-stack source-of-truth choice. Documented per layer; supports investigation and consistency.

Git is the source of truth

Git is the right source of truth. Alert config in git, applied via Terraform, Pulumi, or a custom controller (Datadog provider, Prometheus rule files, PagerDuty Terraform); console edits are forbidden because CI runs terraform plan weekly to detect drift and the drift triggers a ticket to the offending team; branch-protected and code-reviewed like application code.

Config in git. Alert config applied via Terraform, Pulumi, or custom controller; the canonical source.
Console edits forbidden. CI runs terraform plan weekly; drift triggers a ticket to the offending team.
Branch-protected. Code-reviewed like application code; same workflow as the rest of engineering.
Per-tool provider. Datadog provider, Prometheus rule files, PagerDuty Terraform; the deployment-time mapping.

The alert inventory

The inventory is generated, not maintained. Single inventory with alert name, owner team, severity, source tool, runbook URL, dashboard URL; generate from the git repo (canonical) rather than each tool’s API (deployment target); publish weekly so anyone can search for “who owns alert X” without asking around.

Single-table inventory. Alert name, owner team, severity, source tool, runbook URL, dashboard URL.
Generate from git. Git repo is canonical; API is the deployment target; the inventory follows the source.
Publish weekly. Anyone can search for “who owns alert X” without asking around.
Per-team ownership view. The inventory exposes per-team alert counts; supports the noise-attribution work.

Migrating off console-edited alerts

The migration is staged. Phase 1: import existing alerts via terraform import or equivalent (one-time bulk migration). Phase 2: lock down console write access (read-only for everyone except a break-glass account). Phase 3: monthly drift detection (drift means someone broke the rule, investigate).

Phase 1: import. terraform import or equivalent; one-time bulk migration.
Phase 2: lock down write. Console read-only for everyone except a break-glass account.
Phase 3: drift detection. Monthly drift check; drift means someone broke the rule; investigate.
Per-phase exit criterion. Each phase has documented criteria for moving on; supports steady migration.

When to pick the IaC approach

The decision is scale-driven. Above 50 alerts or 3+ teams, the IaC approach pays back within a quarter; below that, a documented spreadsheet inventory is enough; don’t try to manage alerts in 5 tools by hand because the drift will eat you.

Above 50 alerts or 3+ teams. IaC approach pays back within a quarter; the threshold for investment.
Below: spreadsheet enough. Documented spreadsheet inventory; the lower-cost option.
Don’t hand-manage 5 tools. The drift will eat you; the manual ceiling is real.
Per-org decision recorded. The IaC-vs-manual choice committed; supports later revisiting.