Kubernetes Intermediate By Samson Tanimawo, PhD Published Sep 4, 2026 10 min read

Kubernetes Storage and CSI Drivers Explained

PV, PVC, StorageClass, CSI, four nouns and a layered model. Then the block-storage gotchas that only show up at scale and break a Tuesday morning deploy.

Why storage in K8s is its own thing

Kubernetes was designed around stateless. Compute scales, services move, pods get rescheduled, and the storage has to follow. The whole storage stack exists to give a pod the illusion that the disk it’s attached to is local, durable, and consistent, even when the pod just landed on a different node.

The hard problem. Block devices are inherently node-attached, an EBS volume, a GCE PD, an Azure managed disk. The pod might land on any node in the cluster; the volume needs to detach from the previous node and attach to the new one before the pod can mount. That dance is what the storage stack manages.

The cost when it goes wrong. Stuck volumes. Pods stuck in ContainerCreating for 10 minutes because the volume is still attached to a dead node. Database pods that won’t reschedule. The kind of incident that drains a whole afternoon and leaves an aftertaste of distrust.

The four layers

PersistentVolume (PV) is the cluster-level resource, an actual chunk of storage that exists somewhere. EBS volume, NFS export, CephFS share. The PV is independent of any pod; it has a lifecycle of its own.

PersistentVolumeClaim (PVC) is the pod-level request. “I need 100GiB of fast storage with read-write access.” The PVC says what the pod wants; the cluster matches it to a PV.

StorageClass is the recipe for dynamic provisioning. Instead of pre-creating PVs, you define a StorageClass (“gp3-ssd-encrypted”), and when a PVC requests it, the cluster creates a PV on demand. The StorageClass references a CSI driver and a set of parameters.

CSI (Container Storage Interface) is the plug-in protocol. The driver implements the actions: provision-volume, attach-volume, mount-volume, snapshot, expand. CSI replaced the older in-tree drivers; today every major storage vendor ships a CSI driver.

The flow. Pod requests a PVC; PVC references a StorageClass; StorageClass triggers a CSI driver; CSI driver provisions a PV; PV is bound to the PVC; pod mounts. The four layers are a chain; debugging a storage issue means walking the chain to find the broken link.

What CSI actually does

The CSI driver is two pieces: the controller (one-per-cluster) and the node plugin (one-per-node). The controller talks to the cloud API to create and delete volumes. The node plugin runs on each node to attach, mount, and unmount.

The provision step. PVC arrives; controller creates a cloud volume (EBS create-volume, GCE create-disk); marks the PV as available; binds the PV to the PVC. The cloud API call is the slow step; provisions take 30 seconds to a few minutes.

The attach step. Pod scheduled to a node; controller calls cloud API to attach the volume to that node; node plugin sees the new device, formats if needed. Attachment is the failure-prone step at scale.

The mount step. Node plugin bind-mounts the device into the pod’s filesystem. Fast, in-kernel, rarely the failing step.

The detach. Pod is deleted; controller calls detach; node plugin unmounts; cloud API detaches. Detach is asynchronous; if the node is dead, the cloud API may need force-detach to free the volume.

Block-storage gotchas at scale

Gotcha 1: per-node attachment limits. Most clouds limit the number of EBS volumes per instance, 28 on most EC2 types, 16 on smaller ones. A node running 30 pods that each need a PVC will fail to schedule the 29th. The error is “volume failed to attach” with no obvious cause; the fix is bigger nodes or fewer-per-node pod packs.

Gotcha 2: zone affinity. Cloud block storage is zone-bound. An EBS volume in us-east-1a can’t attach to a node in us-east-1b. The pod has to schedule into the volume’s zone. Default schedulers handle this; custom topology constraints can break it.

Gotcha 3: stuck volumes after node failure. A node goes hard-down; its volumes are still “attached” from the cloud’s perspective; pods can’t reschedule because the volumes can’t move. The fix is force-detach, either via the cloud console or via the CSI driver’s force-detach feature; many drivers don’t do it automatically.

Gotcha 4: volume expansion. Modern CSI drivers support online expansion, resize the volume without restarting the pod. But the StorageClass has to set allowVolumeExpansion: true; the filesystem has to support online resize (ext4 and xfs do); and the underlying cloud volume has to be expanded too. Three-step coordination, often broken.

Gotcha 5: snapshot restore. CSI snapshots are great for backups; restoring is where it gets tricky. The restored volume has the data but no PVC binding; you need to create a PVC pointing at the snapshot. Most teams discover this in the middle of a recovery exercise.

Access modes that matter

Three access modes; pods can request any combination.

ReadWriteOnce (RWO) is the default for block storage. One node mounts the volume; one pod (or many on the same node) reads and writes. EBS, GCE PD, Azure managed disk, all RWO.

ReadOnlyMany (ROX) is multi-node read. Useful for shared static content, model weights, build artifacts. Most block storage doesn’t support it; NFS, EFS, GCS-FUSE do.

ReadWriteMany (RWX) is multi-node read-write. The most expensive mode and the one that bites people. Block storage doesn’t do it; only file-storage backends (NFS, CephFS, EFS) do. If you need RWX, you’re running file storage, not block.

The 80/20. Most workloads need RWO. Anything stateful that has to be sharded between pods (Postgres, Kafka) is per-pod-RWO; the application handles multi-pod consistency. RWX is the niche, static assets, shared model files, and should be a deliberate choice, not the default.

Antipatterns

Default StorageClass with no encryption. Many CSI defaults ship without encryption-at-rest. Override the StorageClass; mandate encrypted: true; never have an unencrypted volume in production.

Cluster-default RWX without a real RWX backend. Some teams set RWX on PVCs because “multiple pods want this”, without a backend that supports it. The PVC binds to the closest match (often RWO), and you discover later that pods actually never shared.

Hardcoded zones. StorageClass zone: us-east-1a means every PVC lands in 1a; the cluster loses zone-spread for stateful workloads. Use the CSI driver’s zone-balancing or topology-aware provisioning instead.

No volume expansion enabled. When you need to grow a database, you’ll find out the StorageClass didn’t allow expansion; the only path is restore-to-bigger-volume. Set allowVolumeExpansion: true on every production StorageClass from day one.

What to do this week

Three moves. (1) Inventory your StorageClasses and check for encrypted: true and allowVolumeExpansion: true. The defaults aren’t aggressive enough; explicit is safer. (2) Run a force-detach drill, intentionally take down a node with attached volumes and watch the recovery time. Most teams find their CSI driver doesn’t auto-recover; configure it. (3) Audit RWX usage. If anything has RWX that doesn’t need it, reduce to RWO; the cost and operational complexity drop.