Kubernetes Advanced By Samson Tanimawo, PhD Published Dec 13, 2026 12 min read

Kubernetes StatefulSet Operations: Backups, Upgrades, and Reschedule Risk

Most StatefulSet incidents are scheduling incidents in disguise. The patterns are well-documented; the discipline is rare.

Why StatefulSets are different

Deployments treat pods as interchangeable. StatefulSets do not, each pod has stable identity, stable network DNS, and persistent storage. That stability is what lets you run Postgres or Kafka in the cluster, but it also means scheduling decisions matter much more.

Most StatefulSet outages share a root cause: the cluster decided to reschedule a pod, and the new pod could not get back to its persistent volume in time. Either the PVC was on a node-local disk, the AZ flipped, or the storage class did not support cross-node attach.

Backups that survive a delete

Upgrades without surprise reschedules

updateStrategy: RollingUpdate with a sensible partition value lets you upgrade one replica at a time and gate progression on health. Without partitioning, a bad image rolls to all replicas and your database is down.

Pod disruption budgets at the StatefulSet level prevent voluntary disruptions from taking too many replicas at once. Set minAvailable to (N-1) for a database; the cluster will not drain a node if it would breach.

PVC handling on rollout

Antipatterns

What to do this week

Three moves. (1) Identify every StatefulSet in your cluster; verify each has a tested backup pipeline. (2) Add PDBs to each with minAvailable equal to (N-1). (3) Schedule one StatefulSet restore rehearsal for next sprint.