Cluster Backup Strategy

Cluster state needs backup. The strategy.

etcd

Cluster backup strategy is the discipline that prepares the team for cluster recovery. Without backups, a corrupted etcd or destroyed control plane is unrecoverable; with backups, the team can recover. The discipline is tested backups; untested backups are aspirational.

What etcd backup provides:

Daily etcd snapshots.: The etcd cluster's state is snapshotted daily. The snapshots capture the cluster's API objects, configurations, and metadata. The frequency catches most operational scenarios.
Source of truth for cluster state.: etcd is the source of truth for what the cluster knows. Restoring etcd restores the cluster's awareness of pods, services, configurations. The recovery starts from etcd.
Off-cluster storage.: The backups are stored outside the cluster. Cluster failures should not destroy the backups; the backup storage is independent.
Encryption at rest.: The backup contains sensitive data (secrets, certificates, configurations). Encryption at rest is required; the backup storage uses encryption.
Retention policy.: The backups are retained per policy. Recent daily backups, weekly backups for older periods, archival for long-term. The retention matches the team's recovery needs.

etcd backup is the foundation. Without it, the cluster cannot be reconstructed from scratch.

Velero

Velero handles the application-level backup. While etcd captures the API state, Velero captures the workload-level state including persistent volumes.

Resource definitions.: Velero captures the cluster's Kubernetes resources. Deployments, services, configmaps, custom resources all are part of the backup.
PV snapshots.: Persistent volumes are snapshotted. The application data is preserved; the workload's state is recoverable.
Application-level.: Velero's perspective is the application. Where etcd backups are infrastructure-focused, Velero is workload-focused; the two layers are complementary.
Selective restore.: Velero supports restoring specific namespaces or resources. The team does not need to restore everything; targeted restores fit specific recovery scenarios.
Cross-cluster restore.: Velero can restore to a different cluster than the source. Disaster recovery to a backup cluster, migration between clusters, both use the same Velero capability.

Velero complements etcd backup. Together they provide complete cluster recovery capability.

Test

The backup is tested. An untested backup is unproven; the test produces confidence that recovery actually works when needed.

Quarterly restore.: Once per quarter, the team performs a full restore. A new cluster is brought up from backups; the recovery succeeds; the team's capability is verified.
New cluster from backups.: The restore creates a new cluster, not just adjusts the existing one. The new cluster has all the workloads, all the configurations, all the data. The recovery is comprehensive.
Untested backups are theatre.: A backup that has never been restored is unproven. The first restore during a real disaster may reveal issues; the time pressure makes problems harder to fix; the recovery may fail.
Document the procedure.: The restore procedure is documented. New team members can perform the restore; the institutional knowledge is preserved; the next test is faster.
Track recovery time.: The time from start of restore to fully-functional cluster is tracked. The metric improves over time; the team's recovery capability strengthens; the RTO becomes credible.

Cluster backup strategy is one of those operational disciplines that pays off in the rare cases where it matters. Nova AI Ops integrates with backup and recovery tools, surfaces backup health, and produces the per-cluster recovery readiness view that the platform team uses to verify the discipline is working.