Cluster Graceful Degradation
When the cluster is sick, some workloads still run.
Priority preserved
Cluster graceful degradation is the discipline of designing the cluster so it loses lower-priority workloads first when capacity is tight. Critical workloads survive; non-critical workloads are sacrificed; the cluster degrades gracefully rather than catastrophically.
What priority preservation looks like:
- Critical workloads have system-cluster-critical priority class.: The most critical workloads use the highest priority. The kubelet protects them during pressure; they are evicted last; the cluster's most important services are preserved.
- Survives node pressure.: When a node has pressure (memory, disk, etc.), the kubelet evicts pods. The eviction order respects priority; critical pods stay; non-critical pods are evicted first.
- Tiered priority.: Multiple priority tiers reflect the workload importance. Critical (system-cluster-critical), high (production), medium (staging), low (dev) all have explicit priorities.
- Document the hierarchy.: The team documents the priority hierarchy. New workloads get appropriate priority; operations teams know which workloads survive what conditions.
- Avoid priority inflation.: When everyone wants high priority, the discipline breaks down. The team's policy bounds high priority; only genuinely critical workloads get it.
Priority preservation is the foundation. Without it, eviction is random; with it, the eviction is intentional.
Isolation
Isolation extends graceful degradation. Different workload types run on different nodes; one workload's resource pressure does not affect another's.
- Production on dedicated nodes.: Critical production workloads run on nodes dedicated to them. Other workloads cannot land on these nodes; the production capacity is protected.
- Batch on separate.: Batch workloads run on different nodes from production. Batch's resource bursts do not affect production; production's traffic does not affect batch.
- Tainted nodes.: The dedicated nodes are tainted. Only pods with matching tolerations can schedule there; the isolation is enforced by taints.
- Per-tenant nodes.: Multi-tenant clusters might isolate at the tenant level. Each tenant's workloads run on their nodes; one tenant's issues do not affect others.
- Cost trade-off.: Isolation means more nodes (less packing). The team's cost is higher but the resilience is real; the trade-off is deliberate.
Isolation is the structural defense. The blast radius of a workload's pressure is bounded.
Test
The graceful degradation must be tested. Without testing, the design's effectiveness is theoretical; with testing, it is demonstrated.
- Drain half the nodes.: The test drains half of the cluster's nodes. The remaining capacity is half; the workloads must reschedule; the priority and isolation rules determine what survives.
- Critical workloads keep running?: The test's primary question. If critical workloads are evicted while non-critical pods stay, the priority is wrong. If isolated workloads land on the wrong nodes, the isolation has gaps.
- Verifies design.: The test demonstrates the design works. The team has confidence; the cluster's resilience is real, not aspirational.
- Quarterly cadence.: The test runs quarterly. The discipline keeps the design current as workloads evolve; new patterns are tested before they matter.
- Postmortem failures.: When the test reveals issues, postmortems drive fixes. The discipline produces continuous improvement; the cluster's degradation profile gets better over time.
Cluster graceful degradation is one of those Kubernetes operational disciplines that pays off when capacity becomes tight. Nova AI Ops integrates with cluster telemetry, surfaces priority and isolation patterns, and produces the visibility that the platform team uses to verify the cluster's resilience.