EKS Cluster Upgrade Strategy

K8s upgrades come quarterly. The strategy that keeps clusters current without breaking workloads.

Cadence

Kubernetes upgrade strategy is the discipline of staying current with the platform without disrupting workloads. Falling too far behind produces support-version risk, security exposure, and eventually forced upgrades; staying too aggressive produces churn. The right cadence keeps the cluster within the supported version window and minimizes upgrade-related disruption.

What good cadence looks like:

Stay within 2 versions of latest.: Kubernetes supports the three most recent minor versions. Staying within 2 versions of latest provides margin: when the next release drops, the team is not immediately on an unsupported version. The discipline keeps the cluster in the support window continuously.
Older versions lose support.: Once a version falls out of support, security patches stop, the cloud provider may end management for that version, and the cluster's risk profile worsens. The cost of upgrading from a deeply outdated version is much higher than incremental upgrades.
Plan an upgrade quarterly.: The team plans a Kubernetes upgrade every quarter. The plan covers timing, risk assessment, addon compatibility, application impact. Planning is independent of execution; not every plan results in immediate action.
Execute one every 2 quarters.: Actual upgrades happen roughly every 6 months. The cadence matches the Kubernetes release cycle (about 4 months per minor version). Two quarters lets the team skip every other minor version while staying within the support window.
Match the cloud provider's pace.: EKS, GKE, and AKS deprecate older versions on their own schedules. The team's cadence respects the provider's schedule; falling behind the provider's deprecation window forces emergency upgrades.

Cadence is the foundation. Without a cadence, upgrades become emergencies; with a cadence, they become routine.

Test

The riskiest part of an upgrade is not the version increment itself; it is the interaction of the new version with the team's specific workloads, addons, and configurations. Testing in non-production before production is the discipline that catches these interactions.

Upgrade a non-prod cluster first.: The dev or staging cluster is upgraded first. The same workloads, the same addons, the same network policies as production are exercised against the new version. Issues surface before they reach production.
Run real workloads for a week.: A short test does not catch slow-emerging issues. A week of real traffic against the upgraded cluster surfaces interactions that one-off tests miss. Time is the validation that synthetic tests cannot replace.
Validate addons.: Cluster addons (cert-manager, ingress-nginx, monitoring agents, network plugins) are validated against the new version. Compatibility documentation lists supported versions; the team confirms compatibility before upgrading.
Validate network policies.: Network policies and CNI configurations sometimes behave differently across versions. Specific policies that worked under v1.27 might need adjustment under v1.30. The validation catches these.
Validate custom resources.: Custom Resource Definitions (CRDs) and operators sometimes lag Kubernetes versions. The team verifies that all CRDs and operators support the target version before upgrading.

Testing is what converts upgrade risk into known behavior. The investment is small; the protection is significant.

Stage production

Production upgrades are staged: dev first, then staging, then production. Each stage provides feedback that informs the next. Staging through environments distributes risk and gives the team time to react to issues.

Upgrade dev to staging to prod.: The pipeline is sequential. Dev upgrades first; observed for issues; staging follows; observed for issues; prod follows. The sequence may take days to weeks; the safety improvement is meaningful.
Each stage gives feedback.: What broke in dev that needs fixing before staging? What broke in staging that needs fixing before prod? Each stage is a learning opportunity; the production upgrade applies all the learnings.
Surge upgrade strategy: launch new nodes, drain old, terminate.: For node group upgrades, the surge strategy launches new nodes with the new version, drains workloads from old nodes, terminates old nodes. The cluster always has capacity; workloads migrate gradually.
Pause between stages.: A pause between stages lets the team observe production-like traffic on the upgraded environment. The pause length depends on the change risk; safer changes can move quickly, riskier changes warrant longer pauses.
Document the procedure.: Each upgrade procedure is documented. The runbook captures the steps, the validation checks, the rollback procedure. Future upgrades reference the runbook; the discipline compounds across upgrades.

Cluster upgrade strategy is the discipline that makes Kubernetes operations sustainable over years. Nova AI Ops integrates with cluster telemetry across upgrade waves, surfaces compatibility issues with workloads and addons, and produces the upgrade-readiness report that the platform team uses to drive each upgrade.