Cluster DR Readiness
Disaster recovery readiness. The audit.
Backup strategy
Cluster state via Velero or equivalent. Daily backups; 30-day retention typical.
Etcd snapshots independently. Etcd is the cluster's source of truth; protect it independently.
Persistent volume snapshots per the application backup strategy. Cluster backup does not always include PV data.
Cluster rebuild capability
End-to-end rebuild from scratch in under 4 hours. Tested annually. Without testing, rebuild capability is theoretical.
IaC for cluster provisioning. Terraform, eksctl, gcloud. Reproducible and version-controlled.
Bootstrap scripts for foundational services: CNI, ingress, DNS, monitoring. Automated; not click-by-click.
Disaster recovery testing
Annual: build a fresh cluster from scratch. Time it; document the procedure; identify gaps.
Semi-annual: restore a backup to a fresh cluster. Verify data integrity; verify application functionality.
Quarterly: drain a region or node group. Verify failover and capacity behaviour.
Documentation
Runbook for full cluster loss. Step-by-step procedure tested in drills.
Contact list: cloud account access, vendor escalation, internal teams. The first hour of disaster is not the time to figure out who to call.
Update after every drill. Drift accumulates; documentation stays accurate only with deliberate updates.
Organisational readiness
On-call training for DR scenarios. Engineers should know what to do when 'rebuild the cluster' is the answer.
Cross-team coordination: networking, application, security teams all involved. Plan ahead for the multi-team incident.
Annual tabletop exercise: DR scenario walkthrough. Builds shared understanding without the cost of a real drill.