Config Drift Detection
Drift between repo and runtime.
What config drift is
Config drift is when live infrastructure differs from the IaC source of truth. The git repo says one thing, the cloud says another, and recovery plans built on "we will redeploy from Terraform" silently fail because the live state was never what git described.
- Live differs from IaC. State-versus-config gap per resource; the gap between what git says and what the cloud actually has.
- Standard causes. Manual console changes, partial Terraform applies, third-party automation creating resources, deleted-and-recreated resources outside the workflow.
- Undermines IaC assumptions. If live state is not what git says, recovery plans built on git fail at the moment of recovery.
- Documented drift cost. Per-incident drift contribution captured; the discipline supports investment in detection because the cost is visible.
How to detect drift
Detection options span tools. Terraform plan in CI is the cheapest baseline; drift-detection products add coverage for resources Terraform does not manage; cloud-native rules cover the gaps.
- Terraform plan in CI. Weekly production-state plan; shows differences between state file and config without applying.
- Drift-detection products. AWS Config, Driftctl, env0, Spacelift; specialised tools that cover resources Terraform does not own.
- Cloud-native rules. AWS Config, Azure Policy, GCP Config Validator; native integration with the cloud control plane.
- Named owner per detector. Maintaining team per detector; stale or noisy detectors degrade signal.
Drift response
Response is decision-driven. Tier-1 drift earns same-day alerts; every drift event gets classified as intentional, accidental, or hostile; classification drives the action.
- Tier-1 alerts same day. Criticality-based alert per resource; investigate same day for production-critical drift.
- Classify the drift. Intentional (someone made a deliberate change), accidental (script or click error), or hostile (unauthorised); each calls for different actions.
- Log every drift event. Timestamped record per drift; patterns reveal which teams or tools cause most drift.
- Named action per classification. IaC update, revert, or incident path per class; "drift logged but never acted on" is the failure mode.
Drift prevention
Prevention is harder than detection. Break-glass console access, read-only IAM by default, and policy enforcement together prevent most drift; you cannot fix the cultural patterns without the technical guardrails.
- Console write via break-glass. SSO plus audit-log access for write operations; default console access is read-only.
- Read-only IAM by default. Engineers debug via read-only paths; the deployer role is the named exception with audit trail.
- OPA or Sentinel policies. Resource-creation block outside Terraform at the cluster or cloud-provider level; hard prevention beats hopeful detection.
- Named exception process. Break-glass path documented; "everyone uses break-glass" overuse is the failure mode without process discipline.
How to deploy drift detection
Deploy in stages. Detection first to build confidence in the signal; enforcement second once detection is proven; track drift count over time as the operational metric.
- Weekly Terraform plan first. Production-state plan with diff sent to Slack; cheap baseline that catches the most common drift.
- Policy enforcement second. Add prevention only after detection is trusted; premature enforcement on noisy signal damages confidence.
- Track drift count over time. Quarterly trend toward zero drift on tier-1 resources; the metric drives the program's investment.
- Quarterly drift retro. Cause-class review per quarter; targeted fixes follow patterns rather than one-off complaints.