The Network ACL Drift Agent: Detection + Proposal
ACLs drift from intended state. The agent that diffs declared and observed, classifies the drift, and proposes a corrective change that a human approves.
Declared vs observed
The agent reasons over two sources of truth and the gap between them. Declared lives in your IaC; observed lives in the cloud provider; drift is the diff. Most of the work is filtering noise so the meaningful diffs surface.
- Declared state. Terraform, Pulumi, CloudFormation. The agent treats it as authoritative for what the network ACL should be.
- Observed state. Pulled from the cloud provider API. Whatever is actually live, regardless of how it got there.
- Benign drift. Timestamps, ARNs that auto-rotate, ephemeral metadata. The agent filters this class out before the operator ever sees it.
- Meaningful drift. Rules that affect connectivity or exposure. These are the diffs the agent surfaces.
Drift classification
Surfacing drift without classifying it is a paging stream nobody reads. Three classes cover almost every real-world case and route to different owners.
- Manual change. Someone edited the ACL directly in the console. The drift is real and unauthorised; route to the security team.
- Pending IaC change. The IaC was updated but not yet applied. The drift will resolve itself on next apply; auto-resolve in 6 hours if no action is taken.
- External system change. An attached service (a service mesh, a managed Kubernetes networking layer) modified the ACL. Expected by design but should be acknowledged once.
- Unknown class. If the agent cannot classify with confidence, surface as “unclassified drift” rather than guessing. Confident wrong is worse than uncertain.
Propose, do not apply
Network ACLs have a large blast radius. The agent proposes; humans apply. Skipping the human step is how outages become security incidents.
- Terraform diff. For each meaningful drift, the agent emits a corrective change as a reviewable Terraform diff, not as a console click.
- Human review. The diff is reviewed by humans. The agent never applies ACL changes itself; the blast radius is too large.
- Normal pipeline. Accepted changes flow through the team’s standard IaC pipeline. The agent does not bypass code review or plan checks.
- Reject record. Rejected proposals are logged with the operator’s reason. The agent learns which patterns the team accepts vs declines.
Daily scan
Daily cadence balances signal quality against alert fatigue. Older drift is already in someone’s queue; fresher drift is what the agent should surface.
- Daily window. The agent runs once per day and surfaces drift that appeared in the last 24 hours. Tighter cadence multiplies noise without finding more issues.
- Backlog at install. The first run will surface a backlog of historical drift. Plan time to triage on day one; do not be surprised by the volume.
- Steady state. After the backlog clears, daily scans typically surface zero to three drift events. The agent is mostly quiet, which is the success state.
- Out-of-cycle scan. After any large infrastructure change or incident, trigger an immediate scan rather than waiting for the daily slot.
Audit trail
The audit trail is what makes the agent acceptable to security and compliance. Without it, you have automation; with it, you have controlled automation.
- Per-drift log line. Declared state, observed state, classification, proposed correction. One row, one drift event.
- Per-action log line. Every human action on a proposed correction (accepted, rejected, modified) is logged with timestamp and operator identity.
- SOC2 alignment. The audit trail satisfies SOC2 controls around configuration management. The auditor reads the log; the team is not in the audit critical path.
- Retention. Keep at least 12 months. ACL forensics in incident reviews routinely reach back further than 90 days.