Cluster Version Tracking
Track every cluster's version centrally.
Inventory
Cluster version tracking is the discipline of knowing which versions of Kubernetes, addons, and components run across the team's clusters. Without tracking, drift accumulates; some clusters get behind; security and capability gaps appear. With tracking, the team can drive consistent upgrades and detect outliers.
What good inventory looks like:
- Per cluster: control plane version, node version, addon versions.: Each cluster's full version stack is captured. Control plane (api-server, scheduler, controller-manager), node OS and kubelet, CNI version, ingress version, monitoring agent versions all are tracked.
- Single source of truth.: The inventory lives in one place. The platform team's tooling queries this source; conversations reference it; decisions use it.
- Auto-collected.: The inventory is collected automatically. Each cluster reports its versions; the inventory updates in near-real-time; manual entry is unnecessary.
- Historical tracking.: Beyond current state, the inventory tracks history. The team can see when clusters upgraded; the timeline supports postmortem and audit conversations.
- Per-component breakdowns.: The inventory breaks down by component. A specific question (which clusters run cert-manager 1.13?) is answerable from the inventory.
Inventory is the foundation. Without it, version tracking is anecdotal; with it, it is data-driven.
Dashboard
The inventory feeds dashboards. Outliers (clusters significantly behind), stale clusters (no upgrades in months), and version-fragmentation patterns all surface visually.
- Outliers visible.: Clusters running significantly older versions than the rest are visible. The platform team can target them for upgrade attention.
- Stale clusters surface.: Clusters that have not been upgraded in months surface. The reasons differ (busy team, complex workload, regulatory hold); the dashboard makes the staleness visible.
- Drives upgrade prioritisation.: The dashboard's outliers and stale clusters are the upgrade backlog. The platform team's upgrade work is prioritized by data, not by squeaky wheel.
- Per-team views.: Different teams own different clusters. The dashboard supports per-team views; each team sees their own clusters' status; accountability is clear.
- Trend over time.: The dashboard shows trend. Version skew across the fleet should bound; if it grows, the discipline is degrading. The trend is the program-level metric.
The dashboard turns the inventory into action. Without it, the data exists but does not drive change.
Policy
The team's policy specifies the allowed version skew. Within the policy, clusters operate normally; outside it, the cluster needs attention. The policy bounds the fleet's heterogeneity.
- Maximum 2 versions skew across fleet.: The fleet's clusters should be within 2 minor versions of each other. Some clusters might be on N; others on N-1 or N-2. Beyond this, the support story degrades.
- Otherwise fragmentation.: When clusters span many versions, operational support becomes harder. Bug fixes that work on one version do not work on another; documentation is per-version; the team's effort fragments.
- Drives the upgrade cadence.: The policy implies a cadence. As Kubernetes releases new minor versions, the fleet must keep pace; clusters falling outside the skew get prioritized for upgrade.
- Documented and reviewed.: The policy is documented and reviewed periodically. Some teams need tighter skew (regulatory); some can accept wider (early-stage). The review keeps the policy current.
- Auditor-ready.: The policy and the inventory together produce auditor-ready evidence. The team's claim of fleet consistency is documented and verifiable.
Cluster version tracking is one of those platform disciplines that pays off across many clusters and many years. Nova AI Ops integrates with cluster inventory tools, surfaces version drift, and produces the per-cluster upgrade prioritization queue that the platform team uses.