The Monitoring-as-Code Migration
Most teams have UI-clicked monitors. The migration to code-defined monitors, the order of operations, and the team behaviour it changes.
Phase 1: export
Monitoring-as-code migration is the multi-month project of moving alert rules, monitors, and detection patterns from UI-edited state to declarative configuration. The migration touches every team's monitors; it produces version-controlled monitoring that supports the same reviewability and reproducibility as application code. The discipline is real; the benefits compound.
What phase 1 looks like:
- Export existing UI monitors to declarative config.: Each monitor in the UI is exported to the team's chosen format. Terraform, vendor-specific YAML, or similar. The export captures the exact state of the monitor; the configuration is now portable.
- Terraform, vendor-specific YAML.: The format depends on the platform. Datadog Terraform provider, Grafana provisioning files, PagerDuty Terraform, OpsGenie API config. Each platform has its supported declarative format.
- Commit to git.: The exported monitors are committed to git. They live alongside application code or in a dedicated monitoring repository. The version control is the foundation of everything else.
- Existing monitors are now versioned.: Every change to a monitor produces a commit. The history shows who changed what when. Reverts are git revert; investigations have data; compliance reviews have evidence.
- No new state changes outside code.: Once exported, the team commits to making future changes through the code. The discipline is enforced by tooling; UI changes that drift from code are auto-detected and either reconciled or reverted.
Phase 1 is the bulk of the migration work. The export and reconciliation effort is significant; the result is the foundation for everything that follows.
Phase 2: freeze UI
Once the monitors are in code, the UI must be frozen. New changes through the UI undermine the discipline; reasonable people clicking around inevitably produce drift. The freeze is the discipline that keeps the discipline.
- New monitors must come through code.: The team's standard practice for creating new monitors is to write the code, commit it, and let CI apply it. UI creation is no longer the path; the team learns the new workflow.
- UI access is read-only for most engineers.: Permissions are adjusted. Read access stays open; write access is restricted to a small set of authorized accounts (typically the CI service account). The restriction enforces the discipline.
- Some hold-outs: emergency creation through UI is allowed but rare.: Genuine emergencies (production fire that needs a new alert immediately) can use the UI. The exception is bounded; the emergency creation is followed by a code commit that captures the same monitor.
- Drift detection.: A scheduled job compares the platform's current state to the code. Drift is flagged; the team reconciles. Drift over 7 days produces escalation; the discipline is enforced by automation.
- Documentation.: The new workflow is documented. New engineers learn the discipline; the workflow becomes the obvious path. The documentation supports the discipline's adoption.
The freeze is the operational discipline that makes phase 1's export sustainable. Without it, the team gradually drifts back to UI-edited reality.
Phase 3: compound
The benefits compound after the freeze stabilizes. Common patterns emerge; templates capture them; the platform team improves the patterns; everyone benefits. The discipline produces increasing returns over time.
- Common patterns become reusable modules.: The "standard SLO alert" pattern is the same across many services. Once captured as a Terraform module or template, every new service uses the same pattern. The consistency improves; the per-service work drops.
- Templates spread across teams.: A team that develops a useful monitoring template shares it with other teams. The shared module works for many services; improvements to it benefit all consumers. The platform-effect is real.
- Quality compounds.: Each iteration of the templates is better than the last. Bug fixes propagate; new features are added; the platform's monitoring quality rises continuously without per-service work.
- Engineering team improves the platform.: A dedicated platform or SRE team can focus on improving the templates. Their work has high leverage: improvements affect every service that uses the templates. Without the templates, similar improvements would require touching every service individually.
- Everyone benefits.: Service teams adopt the templates; platform team improves the templates; service teams benefit again. The cycle compounds; monitoring quality across the organization rises continuously.
Monitoring-as-code migration is one of those multi-month projects that produces decade-long benefits. Nova AI Ops integrates with monitoring platforms across providers, surfaces UI-edited monitors that drift from code, and produces the reconciliation queue that keeps the discipline alive after the migration.