Cluster DR Readiness

Disaster recovery readiness. The audit.

Backup strategy

Cluster DR starts with backups. Velero or equivalent for cluster state, etcd snapshots taken independently because etcd is the cluster’s source of truth, and persistent volume snapshots per the application backup strategy because cluster backup does not always include PV data.

Cluster state via Velero. Daily backups; 30-day retention typical; supports namespace and resource recovery.
Etcd snapshots. Independent of cluster backup; etcd is the cluster’s source of truth and deserves its own protection.
Persistent volume snapshots. Per-application backup strategy; cluster backup does not always include PV data.
Per-tier retention policy. Cluster, etcd, PV each get a documented retention window; supports compliance and cost control.

Cluster rebuild capability

Backup without rebuild capability is half the story. End-to-end rebuild from scratch in under 4 hours, tested annually; IaC for cluster provisioning so the rebuild is reproducible; bootstrap scripts for foundational services so rebuild is automated, not click-by-click.

End-to-end rebuild. From scratch in under 4 hours; tested annually; without testing, rebuild capability is theoretical.
IaC for provisioning. Terraform, eksctl, gcloud; reproducible and version-controlled.
Bootstrap scripts. CNI, ingress, DNS, monitoring; automated, not click-by-click.
Per-cluster rebuild runbook. The exact sequence committed to the runbook; supports a stressed engineer at 3am.

Disaster recovery testing

DR is only real if it’s tested. Annual fresh-cluster build, semi-annual backup restore to verify data integrity, quarterly region or node-group drain to verify failover; the cadence catches drift between intent and reality before the actual disaster.

Annual fresh build. Cluster built from scratch; time it, document the procedure, identify gaps.
Semi-annual restore. Backup restored to a fresh cluster; verify data integrity and application functionality.
Quarterly drain. Region or node group drained; verify failover and capacity behaviour.
Per-test gap log. Each drill produces a documented gap list; supports continuous DR readiness improvement.

Documentation

The first hour of disaster is not the time to figure out who to call. Runbook for full cluster loss with step-by-step procedure tested in drills, contact list for cloud accounts and vendors and internal teams, deliberate updates after every drill because drift accumulates without them.

Full-cluster-loss runbook. Step-by-step procedure; tested in drills, not just written.
Contact list. Cloud account access, vendor escalation, internal teams; ready before the disaster.
Post-drill updates. Drift accumulates; documentation stays accurate only with deliberate updates.
Per-runbook owner. Named owner accountable for the runbook’s freshness; supports the maintenance discipline.

Organisational readiness

DR is multi-team. On-call training for DR scenarios, cross-team coordination across networking and application and security teams, annual tabletop exercise that builds shared understanding without the cost of a real drill; the readiness lives across the org, not in any single runbook.

On-call DR training. Engineers should know what to do when “rebuild the cluster” is the answer.
Cross-team coordination. Networking, application, security teams all involved; plan ahead for the multi-team incident.
Annual tabletop exercise. DR scenario walkthrough; builds shared understanding without the cost of a real drill.
Per-team DR responsibility. Each team’s DR role documented; supports clear coordination during the actual incident.