The Snapshot Frequency Matrix for Recovery
Snapshot frequency drives RPO. The matrix that picks the right cadence per workload class.
RPO drives the cadence
Recovery Point Objective is the maximum acceptable data loss per service. Snapshot frequency directly bounds RPO: daily snapshots mean RPO of 24 hours; hourly means 1 hour; per-15-minutes means 15 minutes plus reconstruction. Below 15-minute snapshots, continuous replication is usually cheaper and tighter.
- RPO definition. Maximum acceptable data loss; per-service, in business time.
- Frequency bounds RPO. Daily means 24 hours; hourly means 1 hour; per-15-minutes means 15 minutes plus reconstruction.
- Sub-15-minute switch to replication. Continuous replication is cheaper and tighter at high frequencies.
- Per-service RPO contract. RPO is part of the service contract; supports stakeholder alignment.
The matrix by criticality
Snapshot strategy maps to criticality tier. Critical (financial transactions, user data): continuous replication plus daily snapshots for archive, RPO seconds. Important (production databases, core configs): hourly snapshots plus continuous replication, RPO 1 hour. Standard: daily snapshots, RPO 24 hours. Low-criticality: daily or weekly with recovery from source.
- Critical. Financial transactions, user data; continuous replication plus daily snapshots; RPO seconds.
- Important. Production databases, core configs; hourly snapshots plus continuous replication; RPO 1 hour.
- Standard. Reporting databases, analytics; daily snapshots; replication optional; RPO 24 hours.
- Low-criticality. Logs, derived data; daily or weekly; often regenerable from source.
Retention policy per tier
Retention scales with tier. Critical: 30 days hot, 90 days warm, 7 years cold for compliance, cost real but justified. Important: 14 days hot, 30 days warm, covers most recovery scenarios. Standard: 7 days hot, 30 days warm. Low-criticality: 7 days, cheap to keep and cheap to lose.
- Critical: 30 hot, 90 warm, 7 years cold. Compliance retention; cost real but justified.
- Important: 14 hot, 30 warm. Most recovery scenarios fit within this window.
- Standard: 7 hot, 30 warm. Beyond that, data is rarely needed.
- Low-criticality: 7 days. Cheap to keep, cheap to lose; the right floor for the tier.
Test recovery, not just snapshots
Snapshot existence is not recoverability. Quarterly: pick a snapshot, restore to clean environment, verify integrity, time the recovery because RTO compounds with RPO (1-hour RPO plus 8-hour RTO leaves 9 hours of customer impact). Document the procedure because untested procedures fail under pressure.
- Quarterly restore drill. Pick a snapshot, restore to a clean environment, verify integrity.
- Time the recovery. RTO compounds with RPO; 1-hour RPO plus 8-hour RTO leaves 9 hours of customer impact.
- Document the procedure. Untested procedures fail under pressure; the first restore should not be in production.
- Per-tier drill cadence. Critical tiers drilled more often than low-criticality; supports the response window.
Cost considerations
Snapshot cost scales with frequency, retention, and data size. A 1TB database with hourly snapshots and 30-day retention is 720 snapshots, real money. Incremental snapshots help because most cloud providers store only changed blocks; cross-region replicated snapshots double the cost but provide region-failure protection.
- Cost formula. Frequency × retention × data size; the budget line item.
- Incremental snapshots help. Most providers store only changed blocks; marginal cost per snapshot is small after the first.
- Cross-region replicated. Doubles cost but provides region-failure protection; required for compliance in many industries.
- Per-tier cost budget. Documented per tier; supports continuous cost discipline.