Observability as Code: Treating Dashboards Like Software
Click-built dashboards are accurate the day they're built and slowly diverge from reality. Observability-as-code keeps them honest, every panel, alert, SLO lives in git, gets reviewed, and ships with the service it watches.
Click-built vs as-code
Click-built dashboards live in vendor UI. They're easy to start, hard to review, impossible to diff. Observability-as-code dashboards live in git as YAML/JSON/Jsonnet. They get reviewed, versioned, and deployed with the service.
The discovery. Most teams start click-built because it's faster. Six months in, they have 50 dashboards; nobody knows which are accurate; no audit trail; no rollback. The migration to as-code happens because the click-built debt becomes unbearable.
The shift's nature. As-code isn't a tool change; it's a culture change. Engineers must edit YAML/JSON instead of clicking. PR reviews become part of the dashboard workflow. The shift takes 3-6 months for most teams.
Benefits
Reviews catch threshold mistakes before they ship. Deletion of a dashboard becomes a PR, not an accident. Onboarding a new service includes its observability, automatically.
The review benefit. A dashboard PR is reviewable like code. Bad threshold? Reviewer flags. Misnamed metric? Reviewer flags. Missing context? Reviewer asks. Click-built dashboards have no review; the engineer's first thought becomes the dashboard.
The versioning benefit. Git history shows when a dashboard changed and why. "Who tightened the SLO threshold last quarter?" is answerable from git log. Without versioning, the answer is "nobody knows."
The deployment automation. New service ships? Its dashboard ships with it. The dashboard config is in the same repo as the service; the deploy pipeline applies both. Observability isn't a separate manual step.
Three tooling options
- Grafonnet / Jsonnet: programmatic, type-safe, used by big platforms.
- Terraform provider: works for Datadog, Grafana, New Relic. Familiar idiom.
- Vendor-native: Datadog dashboard JSON, Honeycomb boards-as-yaml. Simplest to start.
Grafonnet's strength. Programmatic generation. A function generates 20 service dashboards by iterating over service definitions. Reduces duplication; updates propagate automatically. The cost is learning Jsonnet; payoff is at scale.
Terraform's strength. Familiarity. Teams already using Terraform for infrastructure use the same tool for observability. Single tool for both reduces operational complexity. The cost is Terraform's verbosity for dashboard configuration.
Vendor-native's strength. Lowest barrier. Export the existing click-built dashboard as JSON; commit it to git; iterate from there. The simplest migration path; no new tools.
Migration path
Don't migrate everything at once. Pick one new service, do its dashboards as code from day one. Six months later most new services follow the pattern naturally; legacy dashboards stay in the UI until they're consequential enough to migrate.
The 80/20 of dashboards. Most teams have 20% of dashboards that are constantly used (production services) and 80% that are rarely opened (legacy, special projects, exploratory). Migrate the 20%; the 80% can stay click-built indefinitely.
The migration tooling. Most vendors export dashboards as JSON. Export, commit, iterate. The first migration is awkward; the second is smoother; by the fifth, the team has a workflow.
The failure mode
Generators that produce 200-line YAML for every dashboard. Engineers stop reading the diffs; the review becomes rubber-stamp. Keep generated dashboards SHORT, under 100 lines per service is a good upper bound.
The verbosity trap. Vendor JSON is naturally verbose. A generated dashboard is 500-1000 lines. Reviewers can't catch issues; they approve without reading. The review becomes ceremonial.
The countermeasure. Generate the dashboard from a higher-level definition (e.g., "service has SLO X") that's 10 lines. The generator produces the verbose YAML; the diff that gets reviewed is the 10-line definition, not the 500-line output. Reviews stay meaningful.
Maintaining the discipline
Once observability-as-code is established, the discipline that keeps it real is "no new dashboards via UI." The exception that swallows the rule: engineer wants to iterate quickly; clicks together a dashboard; commits to migrate it later; never does. Now you have a UI dashboard nobody else knows about.
The enforcement. Vendor's permissions can prevent UI editing in production; engineers must use the as-code path. Heavy-handed but effective. Alternative: peer pressure (the team conventions reject UI-edited dashboards as PRs).
Common antipatterns
The "we'll migrate everything" rebuild. Team commits to migrating all 100 dashboards in a quarter. The work consumes the team; new feature development stalls. Migrate gradually; legacy stays click-built.
Generators that hide complexity. Engineers can't read the generated output; can't debug when something looks wrong. Generators are great when their inputs are concise; they fail when the output becomes the only artifact.
Dashboards in a separate repo from the service. Service code in repo A; dashboards in repo B. Two PRs needed for any change; coordination overhead. Co-locate; one repo, one PR.
The "dashboards-as-code" label without the discipline. Team writes JSON to a repo; nobody reviews; nobody syncs to production. The label is on; the practice isn't. Both are required.
What to do this week
Three moves. (1) Pick a tooling option. Vendor-native is the lowest barrier; pick it unless you have specific reason for Grafonnet/Terraform. (2) For your next new service, ship its dashboards as code from day one. The new-service path is where the practice gets established. (3) Document the migration path for legacy dashboards. Most won't migrate; the few important ones should be on a quarterly roadmap.