The Deletion Protection Discipline Across Resources
Most accidental deletions could have been prevented. The protection model and which resources should be protected by default.
Default-protect
Deletion protection is the discipline of preventing accidental destruction of critical resources. Cloud APIs make deletion easy; one wrong terraform apply or one mistyped CLI command can wipe production data. Default-protecting critical resources adds friction that prevents the mistake. The friction is intentional; the cost of the friction is far less than the cost of the mistake.
What default-protect looks like:
- Production databases.: RDS instances, DynamoDB tables, and other production databases have deletion protection enabled by default. Deletion requires an explicit step to disable protection first. The two-step process catches accidents.
- S3 buckets with customer data.: Buckets containing customer data have MFA delete enabled, versioning enabled, and lifecycle protections that prevent accidental purge. Deleting a versioned bucket requires multiple deliberate actions; accidental deletion is structurally hard.
- IAM policies.: Critical IAM policies (admin policies, root account policies, organization-level policies) are protected. Modifications require deliberate change processes; deletion requires explicit permission removal first.
- IaC enforces the protection.: Terraform, CloudFormation, or similar IaC tools include the protection in the resource definitions. Drift detection catches resources where protection has been disabled. The IaC layer is the source of truth.
- Manual deletion requires explicit unprotect step.: When deletion is intentional, the workflow is: disable protection (recorded), delete resource (recorded), confirm deletion. The recorded steps produce an audit trail; accidental deletion is impossible because the unprotect step is its own deliberate action.
Default-protect is the structural defense. It does not depend on operator vigilance; it depends on the configuration that exists by default.
Engineering escape
Legitimate deletion needs to happen sometimes: migrations, decommissions, environment teardowns. The engineering escape lets the team unprotect when justified, but the escape is logged, reviewed, and re-protected after.
- Engineers can unprotect for migration.: Documented procedures exist for unprotecting resources during legitimate work. The procedure includes the justification, the expected duration, and the re-protection plan. The escape exists; it is just not the default.
- Logged.: Every unprotect action is logged. CloudTrail records the action; the configuration management system records the change. The audit trail is complete; the unprotect cannot happen invisibly.
- Reviewed.: Logged unprotect actions are reviewed periodically. Was the unprotect justified? Was re-protection completed? Were there any actions during the unprotect window that look unusual? The review catches both legitimate gaps and policy violations.
- Re-protected after.: Once the work is complete, protection is re-enabled. The unprotect window is bounded; protection is the default state. Resources that remain unprotected after the work is done are the failure mode the review catches.
- The friction is the point.: The unprotect step is itself a friction layer. It catches the kind of accident where someone runs the wrong terraform apply by mistake; the unprotect is missing, so the deletion fails. The friction protects against the most common mistakes.
The engineering escape balances safety against operational reality. With it, the discipline is sustainable; without it, teams disable protection silently to get work done.
Recovery if unprotected
Even with deletion protection, mistakes happen. Recovery mechanisms are the last line of defense: when something is deleted that should not have been, can the team get it back?
- Backup retention.: Critical resources have backups with retention periods sized for recovery. RDS automated backups; S3 versioning with appropriate retention; DynamoDB point-in-time recovery. The backup is the recovery source.
- Soft-delete with grace period.: Some platforms support soft delete: the deletion is logical first, with a grace period during which recovery is possible. Hard deletion happens after the grace period expires. The pattern adds another safety window.
- Test recovery.: The recovery procedure is tested. The team performs a disaster-recovery exercise; the recovery actually produces the recovered resource; the runtime is measured. Without testing, the recovery is theoretical.
- An untested recovery is theatre.: Recovery procedures that have never been tested under real conditions fail when needed. The first attempt at recovery should be a planned exercise, not a real incident. The cost of testing is far less than the cost of failed recovery.
- Document the recovery time.: The team documents the realistic recovery time for each protected resource class. RTO discussions use the documented numbers; SLAs reference them. The documentation supports both operational and compliance discussions.
Deletion protection discipline is one of those compounding safety disciplines that pays off in the rare cases where it matters. Nova AI Ops integrates with cloud configuration data, surfaces resources without protection that should have it, and tracks unprotect events for the periodic review that closes the loop.