Zero-Downtime Kubernetes Cluster Upgrades: A Playbook
Kubernetes ships a new minor version every four months and EOLs old ones every twelve. Skipping upgrades is not an option. Here is the practical playbook for upgrading a production cluster without dropping a request.
Why Upgrades Are Riskier Than They Should Be
Kubernetes upgrades have a worse reputation than they deserve. The control plane upgrade is genuinely safe most of the time. The node upgrade is mechanically simple. The risk concentrates in three places that are mostly preventable: deprecated APIs in your manifests, workloads without proper Pod Disruption Budgets, and missing or stale storage class definitions.
The default upgrade path on managed services (EKS, GKE, AKS) handles 80% of the work for you. The remaining 20%, workload readiness, careful node draining, and handling stateful services, is where teams cause their own outages. This playbook focuses on that 20%.
Pre-Upgrade: The 8-Item Checklist
Run through this checklist before touching the cluster. Each item takes 5-30 minutes and prevents specific failure modes:
- Check deprecated APIs. Run
pluto detect-helmorkubectl-deprecationsagainst the target version. Remove or migrate any deprecated APIs. The 1.32 → 1.33 upgrade alone removed 14 deprecated APIs. - Verify PodDisruptionBudgets. Every multi-replica Deployment should have a PDB. Without one, the drain will evict all replicas at once.
- Audit StorageClass definitions. Some upgrades change the default StorageClass behavior. Pin your StorageClasses explicitly rather than relying on defaults.
- Check controller dependencies. ingress-nginx, cert-manager, external-dns, and other controllers have minimum and maximum supported Kubernetes versions. Verify compatibility.
- Validate node-pool surge capacity. The upgrade needs spare capacity to bring up new nodes before draining old ones. Confirm autoscaler can provision new nodes.
- Run a dry-run on staging. Upgrade your staging cluster first. Wait 48-72 hours to surface latent issues before touching prod.
- Notify dependent teams. Send a calendar block to engineering teams the day of the upgrade. Surfaces "wait, we have a release planned" conflicts.
- Snapshot critical state. Back up etcd (managed services do this automatically). Snapshot any stateful workloads.
Step 1: Upgrade the Control Plane
The control plane upgrade is the safest part of the process. The API server, scheduler, and controller manager all support N-1 and N+1 worker compatibility, so workers running the previous version keep functioning while the control plane updates.
On managed services:
- EKS:
aws eks update-cluster-version --name <cluster> --kubernetes-version 1.33. Takes 30-60 minutes for the API server upgrade. - GKE: Use the gcloud CLI or console. GKE upgrades the control plane in a rolling fashion across the regional control plane replicas.
- AKS:
az aks upgrade --control-plane-only. Upgrades the control plane without touching nodes.
While the control plane upgrades, monitor the API server availability. Brief connection drops are normal during the rolling control plane replica replacement; sustained errors are not. If the upgrade fails partway through, the managed service typically rolls back automatically.
Step 2: Upgrade Node Pools (Surge Strategy)
Node upgrades are where production impact happens. The standard pattern is "surge upgrade":
- Provision a new node with the target Kubernetes version.
- Cordon (mark unschedulable) one old node.
- Drain the old node (evict all pods, respecting PDBs).
- Wait for evicted pods to reschedule on healthy nodes.
- Delete the old node.
- Repeat until all nodes are upgraded.
The "surge" parameter controls how many new nodes can be provisioned at once. maxSurge: 1 upgrades one node at a time (slow but safe). maxSurge: 25% upgrades 25% of the pool in parallel (faster but higher risk). For production, maxSurge: 1, maxUnavailable: 0 is the safest default.
Common gotcha: The drain hangs when a pod has no PDB and no other replica. Solution: ensure every multi-replica deployment has minAvailable: 1 or maxUnavailable: 1 in its PDB, and use --disable-eviction only as a last resort.
Stateful workloads need extra care. StatefulSet pods are evicted one at a time and need their PVCs to detach and reattach to the new node. Some storage providers (especially older ones) take 60-120 seconds for the volume detach/attach cycle. Build that into your upgrade time estimate.
Step 3: Pod Disruption Budgets That Actually Work
PDBs are the single most important upgrade safety primitive, and they are the most commonly misconfigured. The two PDB modes:
# minAvailable: at least N replicas always running
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-svc-pdb
spec:
minAvailable: 2 # always keep 2 running
selector:
matchLabels:
app: payment-svc
# maxUnavailable: at most N replicas can be down
spec:
maxUnavailable: 1 # never drain more than 1 at a time
For most stateless services with 3+ replicas, maxUnavailable: 1 is the right default. For stateful services or quorum-based systems (etcd, Cassandra, Kafka), minAvailable with an absolute count is safer.
Common gotcha 1: A PDB that requires more replicas than the deployment has. Symptom: drain hangs forever. Fix: PDBs must always allow at least one pod to be evicted; minAvailable: 3 on a 3-replica deployment will block drains entirely.
Common gotcha 2: A PDB on a single-replica workload. Symptom: drain cannot proceed because there is no other replica to evict to. Fix: either increase replicas to 2+ or accept that single-replica workloads will have a brief outage during their node's upgrade.
When to Use Blue-Green Cluster Upgrades
Sometimes the in-place upgrade is too risky. The alternative is to provision an entirely new cluster on the target version and migrate workloads to it.
When blue-green makes sense:
- Skipping multiple Kubernetes versions (e.g., 1.28 → 1.33). The CRD and API churn is too large for safe in-place.
- Major infrastructure changes (changing CNI, switching from cgroups v1 to v2, changing OS).
- Compliance or audit requirements that need a clean slate.
- Risk-averse teams that want a tested rollback path more than they want operational simplicity.
How to do it: Provision the new cluster with the target version. Replicate the workload manifests (GitOps tools like ArgoCD make this trivial). Cut over traffic via DNS or load balancer once the new cluster is healthy. Keep the old cluster running for 2-7 days as a rollback path, then decommission.
Trade-off: Blue-green doubles your cloud spend during the migration window and requires explicit traffic management. The simplicity of "just spin up a new cluster" hides the complexity of "migrate persistent state, rewire ingress, validate every integration."
Rollback: When and How
Most managed Kubernetes services do not support direct version downgrades. Once you upgrade the control plane, you cannot roll it back. This makes the pre-upgrade dry-run on staging non-negotiable.
What you can roll back: Node upgrades. If a new node version causes problems, you can revert that node pool to the previous version while the control plane stays at the new version (Kubernetes supports N+1 worker version skew).
What you cannot roll back: The control plane. If you upgrade EKS to 1.33 and discover a regression in the control plane, your only paths are (a) wait for AWS to ship a patch, (b) blue-green to a fresh cluster on 1.32, or (c) live with the regression and roll forward to a fix.
The decision criteria: If you uncover an issue during the upgrade, the right call is usually to stop the upgrade rather than try to roll back. Halt at the current state, restore the affected workloads to a safe configuration, investigate, and resume the upgrade after the fix.
Post-Upgrade: The 5-Item Verification Checklist
After every upgrade, validate these five items before declaring success:
- kubectl version matches the target. Confirm both client and server are on the new version.
- All system pods are running.
kubectl get pods -n kube-systemshould show no failures or restarts in the last hour. - All workload pods are healthy. Spot-check critical workloads via dashboards or kubectl. Watch for surprise CrashLoopBackOff caused by version-specific changes.
- Cluster autoscaler still works. Trigger a brief scale event (deploy a temporary high-replica workload) to verify node provisioning works on the new version.
- Monitoring and logging still flow. Confirm that metrics, logs, and traces from your workloads are still arriving in your observability stack. Some upgrades change the kubelet API and break older agents.
For teams that want to automate this entire upgrade process, including the pre-upgrade checks, the safe drain orchestration, the post-upgrade verification, and the cross-cluster monitoring, AI-native platforms like Nova AI Ops include continuous deprecation scanning, automated PDB analysis, and upgrade-readiness scoring. The platform also detects subtle post-upgrade regressions (like new pod restart patterns or latency drift) that manual verification often misses. Try Nova to see your cluster's upgrade readiness score.