The Dashboard Versioning Discipline
Dashboards in version control. The discipline that prevents 'who changed this' debates.
Dashboards in git
Export dashboard JSON to git. Grafana, Datadog, and others support API-based export. Commit on every change.
Per-dashboard file. Naming convention: service/purpose.json. Engineers can grep for relevant dashboards by service or signal type.
PRs review changes. The diff is reviewable; subtle modifications surface. Catches accidental changes or bad-default propagation.
CI deployment
On merge, CI deploys the dashboard JSON to the platform via API. Same pipeline as application deploys.
Production dashboards are read-only in the UI for most users. Engineers cannot click-edit prod dashboards; they must PR.
Test in staging first. Dashboard regression in production is rare but possible (broken queries, missing variables). Staging catches it.
Rollback
Git revert restores the previous dashboard. CI redeploys; the change is gone in minutes.
Without versioning, rollback requires manual reconstruction from memory. Painful and error-prone.
Audit trail: git history shows every change, who made it, why. Compliance friendly.
Templating shared dashboards
Common patterns become template dashboards. Service health, deploy status, error rate. Per-service instances generated from templates.
Generators in CI: Jsonnet for Grafana, Terraform for Datadog, custom scripts otherwise. Source of truth is the template plus the service inventory.
Updating one template ripples to all services. New panel added to template means it appears on every service dashboard at next deploy.
Operating discipline
Every dashboard has an owner team in metadata. Ownerless dashboards rot.
Quarterly cleanup: dashboards not viewed in 6 months are flagged. Either documented as still-needed or retired.
User access review. Edit permissions on production dashboards limited to platform engineers; everyone else has read-only or PR access.