Failed Deploy Cleanup
Failed deploys leave artifacts. Clean up.
Failed deploys leave artifacts
Failed deploys leave debris. Half-applied Terraform, partial Helm upgrades, orphaned cloud resources; each piece is a future incident waiting to be triggered by an unrelated change. Cleanup discipline keeps the environment from becoming the next outage.
- Half-applied Terraform. Partial state per deploy; Terraform stopped mid-apply leaves resources unmanaged in state.
- Partial Helm upgrades. Stuck-pending state per release; Helm release left between versions confuses subsequent operators.
- Orphaned cloud resources. Leaked ALB target groups, IAM roles, security groups per deploy; cost and security debt accumulate quietly.
- Each artifact is a future incident. Cumulative debris per environment becomes the next outage if not cleaned.
Cleanup checklist
The cleanup checklist runs the same way every time. Per-tool reconciliation command, named runbook, no improvisation under post-failure pressure.
- Terraform.
terraform state listand reconcile per failure; surfaces unmanaged or duplicate resources. - Helm.
helm history releaseand rollback to last good version per failure; restores known state. - Cloud resources. Orphan scan on ELBs, target groups, IAM roles, security groups per failure; cleans the long tail.
- Documented runbook per tool. Named recovery steps per tool catch "we do not know what to do" stalls during post-failure cleanup.
Automate detection
Detection catches what manual cleanup misses. Drift checks, naming conventions tied to deploy ID, and daily reports together surface the orphans before they become incidents.
- Drift detection. Terraform plan or Driftctl scan per cluster catches lingering changes that manual cleanup missed.
- Naming convention with deploy ID. Deploy-tagged resource name; orphans without an active deploy auto-flag in subsequent scans.
- Daily failed-deploy report. Email per day with failed-deploy and cleanup status; open items get assigned and closed.
- Orphan dashboard per team. Visible debt count per team; accountability follows visibility.
Rollback discipline
Rollback discipline prevents improvising under pressure. Define rollback before the deploy ships, practice quarterly in staging, auto-rollback on safe SLO regressions and reserve manual for high-blast-radius changes.
- Define rollback before deploy. Documented rollback procedure per deploy; do not improvise during the failure window.
- Practice quarterly in staging. Staging rollback drill per quarter; the first real rollback cannot be the first attempt.
- Auto-rollback on SLO regression. Low-blast-radius services per auto-rollback; manual for changes where automated reversal carries risk.
- Rollback-tested tag per deploy. Explicit "rollback verified" gate catches untested rollback paths before the deploy ships.
How to install the discipline
Installing the discipline takes three pieces. Required runbook step gates closure, auto-ticket on failure assigns ownership, quarterly audit drives accountability with visible numbers.
- Required runbook step. Cleanup checklist as a required gate per deploy; no closing the incident without completing cleanup.
- Auto-ticket on failure. Auto-created ticket per failure assigned to the deployer; closes only on completed cleanup.
- Quarterly audit. Residual-orphans-by-team report per quarter; visibility drives accountability and produces real cleanup.
- Named cleanup owner per team. Cleanup-discipline owner per team; "everyone-and-no-one" follow-up is the failure mode without an owner.