Config Drift Prevention With AWS Config
Config rules detect drift. The rules that catch the most common configuration regressions.
High-leverage rules
Configuration drift is the gradual divergence between what infrastructure-as-code declares and what actually exists in the cloud. Drift happens for legitimate reasons (manual fixes during incidents) and illegitimate ones (untracked changes by developers). Prevention is the discipline that keeps drift bounded; the rules layer is where the discipline lives.
What high-leverage rules look like:
- S3 buckets must have encryption enabled.: The rule checks every S3 bucket for server-side encryption configuration. Buckets without encryption are non-compliant; the configuration is one of the easiest security wins. Drift in this dimension is caught immediately.
- EBS volumes must be encrypted.: Every EBS volume must have encryption at rest. The rule applies to all volumes; non-compliant volumes are surfaced. The drift here typically comes from manual instance launches that bypassed the encrypted-by-default setting.
- RDS must have automated backups.: RDS instances must have automated backups configured. The backup retention period is at least the team's recovery target. Drift in backup configuration is high-impact; a database without backups is one ransomware event from being lost.
- Security groups cannot have 0.0.0.0/0 on SSH or RDP.: Open SSH or RDP to the world is one of the most dangerous configurations. The rule blocks it; non-compliant security groups are detected immediately. Drift here often comes from "temporary" rules that became permanent.
- IAM password policies enforced.: Account-level password policy compliance is checked. Minimum length, complexity requirements, rotation periods. Drift in password policy is often a regression from a policy update that did not propagate everywhere.
The rules catalog is organization-specific, but the high-leverage rules are mostly universal. Encryption, backup, and exposure rules cover most of the value.
Auto-remediation
Detection without action is incomplete. Some rules can be auto-remediated: the system applies the correct configuration without human involvement. Others require human review. The split between auto-remediation and alerting reflects the team's confidence in each rule.
- Some rules support auto-remediation.: Adding encryption to an S3 bucket, enabling automated backups on RDS, removing public access from a bucket. The remediation is mechanical; the action is well-understood; the consequences are bounded.
- Others fire alerts only.: Some rules require human judgment before remediation. Closing a security group rule might break a workload; removing an IAM grant might affect a job. These rules surface findings; humans decide.
- Auto-remediate the easy ones.: Rules with clear correct answers and bounded consequences are auto-remediated. The team's burden drops; compliance improves; routine drift is corrected without human attention.
- Alert for the rest.: Rules with judgment-dependent answers route to alerts. The team reviews, decides, and applies the appropriate remediation. The volume is lower because auto-remediation handles the easy cases.
- Document the choice per rule.: For each rule, the team documents whether auto-remediation is enabled and why. The documentation supports future reviews; new rules inherit the decision framework.
Auto-remediation is the multiplier. With it, the rules layer scales; without it, the team is buried in compliance work.
Alerting
The alerting strategy determines how the team learns about drift. Aggressive alerting produces fatigue; passive alerting produces complacency. The right strategy is graduated: routine findings to a dashboard, persistent drift to a page.
- Non-compliant resources go to a dashboard.: The dashboard is the inventory of current drift. The team reviews it during routine work. New findings are visible; trends are observable; the team's attention is appropriate to the severity.
- Drift older than 7 days: page the owner.: Drift that persists is treated as a real issue. After a week, the dashboard finding becomes a page. The owner of the affected resource is notified directly. The escalation prevents drift from accumulating without action.
- Severity tiers.: Different rules have different severity. Encryption misconfiguration on a public-facing service might page immediately; minor tag drift might go to dashboard only. The tier reflects the consequence of the drift.
- Owner attribution.: Each resource has an owner (via tag, naming convention, or external mapping). The alert routes to the owner directly; the team that can fix the issue is notified. Without attribution, alerts go to a shared queue and tend to languish.
- Track resolution.: Findings are tracked through to resolution. The metric "mean time to drift remediation" measures the team's responsiveness. Improving the metric improves the overall security posture.
Config drift prevention is one of those compounding security disciplines. Nova AI Ops integrates with cloud configuration data and policy engines, surfaces drift trends, attributes drift to owners, and produces the audit-ready report that compliance and engineering both reference.