PodDisruptionBudget Tuning
PDBs prevent disruption. Too tight blocks upgrades.
Tight PDB
PDB tuning is the discipline of choosing the right Pod Disruption Budget for each workload. Too tight and operations halt; too loose and the workload is not protected; the right setting balances disruption tolerance with operational needs.
What too-tight PDB looks like:
- minAvailable: 100% blocks all evictions.: A PDB requiring 100% availability prevents any eviction. Drain operations fail; the discipline becomes a barrier rather than a protection.
- Upgrades stop.: Cluster upgrades require draining nodes. Nodes cannot drain because PDBs block. The upgrade halts; the team's progress is blocked.
- Nodes cannot drain.: Voluntary disruption (drain, taint-based eviction) requires PDBs to allow it. 100% PDBs prevent all voluntary disruption; the cluster operations cannot proceed.
- Recognize the symptom.: When drain operations stall, the team checks PDBs. Tight PDBs are often the cause; the fix is loosening them; operations resume.
- Document the cost.: The team understands the cost of tight PDBs. Operations are slower; flexibility is reduced; the cost is real even when the protection is the goal.
Tight PDBs produce predictable problems. The discipline is recognizing this and adjusting.
Loose PDB
The opposite mistake is too-loose PDBs. They allow all the disruption the team wanted to prevent; the workload is not actually protected.
- minAvailable: 1 with replicas: 10 = minimal protection.: Requiring 1 replica when there are 10 means 9 can be evicted. The "protection" allows nearly complete eviction; the PDB is theater.
- Does not catch double-disruption.: If two simultaneous disruptions hit, the loose PDB allows both. The workload is significantly degraded; the protection failed when needed.
- The math matters.: The team calculates: how many replicas can be down before service degrades unacceptably? The PDB should reflect this calculation; loose PDBs ignore the math.
- Recognize the symptom.: When workloads see availability dips during cluster operations, the PDB is likely too loose. The fix is tightening the PDB; the protection becomes real.
- Document the math.: The team's PDB choice has a documented rationale. Future maintainers see the reasoning; the choice is preserved through team changes.
Loose PDBs produce inadequate protection. The discipline is recognizing this and adjusting.
Right
The right PDB allows operations while protecting the workload. The setting depends on replica count and acceptable disruption.
- minAvailable: 75 to 90% of replicas.: A PDB requiring 75 to 90% availability allows some disruption while protecting the bulk. 10 replicas with 80% available means 2 can be down at once; the operation proceeds while the workload is protected.
- Allows rolling.: Rolling operations (upgrades, rotations) work with the PDB. The discipline does not prevent operations; it bounds their impact.
- Prevents catastrophe.: Simultaneous disruption is bounded. Multiple PDB-related disruptions cannot reduce availability below the minimum; the workload's customers see bounded impact.
- Per-workload calibration.: Different workloads need different settings. Critical 24/7 services need higher minAvailable; less critical services can tolerate lower; the discipline is per-workload.
- Test in non-production.: The team tests PDB behavior. Drain operations in non-production reveal whether the PDB allows progress while protecting the workload; calibration follows.
PDB tuning strict is one of those Kubernetes operational disciplines that pays off when cluster operations and workload availability both matter. Nova AI Ops integrates with cluster events, surfaces PDB-related operational patterns, and supports the team's tuning over time.