What IaC is and why it matters
Infrastructure as Code is the practice of managing infrastructure through versioned, machine-readable definition files instead of manual console clicks or ad hoc shell commands. You describe the servers, networks, clusters, databases, and permissions you want in code, commit that code to version control, and let a tool create or change the real infrastructure to match. The code becomes the single source of truth for what your infrastructure is supposed to be, which makes changes reviewable, repeatable, and reversible in a way that point-and-click provisioning never is.
To see why that matters, look at how infrastructure was built before IaC. An engineer logged into a cloud console, clicked through a dozen wizards, opened a few ports, attached some disks, and stood up a server. It worked. Then a second server was built the same way, except the engineer skipped a step or chose a slightly different value, and now the two machines were subtly different. Multiply that across years and dozens of engineers and you get snowflake servers: every machine unique, none reproducible, and nobody quite sure why the one in production behaves differently from the one in staging.
IaC attacks four specific failures of that hand-managed world. Snowflake servers disappear because every machine is built from the same definition, so they come out identical. Undocumented changes disappear because every change is a commit with an author, a timestamp, and a code review, instead of a console click nobody recorded. Slow, error-prone provisioning becomes fast and consistent because a tool applies the same definition every time rather than a human retracing a wizard from memory. And the bus-factor problem fades, because the knowledge of how the infrastructure is built now lives in the repository rather than in one senior engineer's head. The common thread is that IaC turns infrastructure into something you can review, test, diff, and roll back like any other code.
IaC is one pillar of the broader practice of DevOps automation, which covers the full surface from build and test through deploy and operations. This guide is the focused deep-dive on the infrastructure pillar specifically: how the definitions work, where they break, and how to run them safely. If you want the wider automation picture, start with the DevOps automation guide and the broader DevOps practice, then come back here for the IaC mechanics.
The mental model that matters most. IaC is not just "scripts that build servers." The leap is that your code describes a desired state and a tool is responsible for making reality match it, continuously if you want. Once you internalize that, the rest of this guide, declarative reconciliation, state, drift, immutability, follows from it.
Declarative vs imperative
There are two ways to express infrastructure in code, and the distinction shapes everything downstream.
Imperative IaC describes the steps to reach a result. Create this server, then attach this disk, then open this port, then install this package. It reads like a recipe: an ordered list of commands. Shell scripts are the purest imperative form. The problem is that an imperative script assumes a known starting point. Run it twice and it may try to create a server that already exists and fail, or worse, create a second one. The script knows how to go from nothing to something, but not how to reconcile a half-built environment.
Declarative IaC describes the desired end state and lets the tool figure out the steps. Here are the servers, disks, and ports that should exist; make reality match. You never write "create" or "delete," you write "this is what should be true," and the tool computes the difference between what exists and what you declared, then makes only the changes needed to close the gap.
| Property | Imperative | Declarative |
|---|---|---|
| You write | The steps to take | The end state you want |
| Run it twice | May fail or duplicate | Idempotent, converges to same state |
| Knows what exists | No, assumes a start point | Yes, via state and a diff |
| Drift detection | Not built in | Falls out naturally from the diff |
| Typical tools | Shell, early Ansible playbooks | Terraform, OpenTofu, CloudFormation |
| Best for | One-off bootstrap, glue | Managed, long-lived infrastructure |
Declarative tools such as Terraform, OpenTofu, and CloudFormation dominate the IaC market for one reason: idempotency. Because they target a desired state rather than a sequence of steps, you can run them repeatedly and always converge on the same result. Run a declarative apply against an environment that is already correct and it does nothing. Run it against an environment missing one firewall rule and it adds exactly that one rule.
That property comes from the reconciliation loop, which is the heart of declarative infrastructure. The tool reads your desired state from the code, reads the actual state of the world (from its state record and the live cloud APIs), computes the difference, and applies only the changes needed to make the two match. It is the same loop a Kubernetes controller runs continuously and the same loop GitOps automates. Understanding reconciliation is the single most useful thing you can learn about IaC, because state, drift, and GitOps all sit on top of it.
State and drift
If the reconciliation loop is the engine of declarative IaC, state is its memory, and drift is what goes wrong when reality slips out from under it.
What IaC state is and why it matters
IaC state is the tool's record of what real infrastructure it currently manages and how that maps to the resources in your code. When Terraform creates a load balancer, it writes an entry into its state file linking the resource block in your code to the actual cloud resource it created, including the resource's real identifier. State is the bridge between "the abstract thing I declared" and "the concrete thing that exists in AWS."
State matters because it is how the tool knows the difference between three very different operations. Without state, given a resource block in your code and no matching live resource, the tool cannot tell whether it should create the resource (it never existed), do nothing (it exists but state was lost), or that something was deleted out of band. State is what lets the reconciliation loop decide between create, update, and delete instead of guessing.
Because state is that load-bearing, treat the state file as production data:
- Store it in a shared, remote backend, not on one engineer's laptop. An S3 bucket, a GCS bucket, or a dedicated state backend, so the whole team and your CI pipeline read the same truth.
- Lock it during writes. Two applies running at once against the same state corrupts it. State locking (DynamoDB for S3, native locks elsewhere) serializes writes.
- Version and back it up. A corrupted or accidentally deleted state file can orphan real infrastructure that the tool no longer knows it owns. Versioned storage lets you recover.
- Never hand-edit it unless you know exactly what you are doing. The state file is internal bookkeeping; editing it by hand is how you turn a small mistake into a recreate of your database.
Configuration drift: when reality diverges from code
Configuration drift is what happens when the real infrastructure stops matching the code that is supposed to define it. It is the single most common way an IaC shop gets burned, and it almost always starts with a good intention. Someone opens a security group in the console to debug an outage at 2 a.m. and never reverts it. An autoscaler changes an instance count. A manual hotfix edits a config file that IaC also manages. Each of these makes reality diverge from the declared desired state.
Drift is dangerous for a specific, non-obvious reason: the next apply will silently fight it. Because a declarative tool reconciles reality back to the code, the next time anyone runs apply, the tool sees the console-opened security group as a difference from the declared state and quietly closes it, undoing the fix the on-call engineer relied on. Or the apply hits a manual change it cannot reconcile and fails in a confusing way during an unrelated deploy. Either way, the manual change and the code are now at war, and the apply is the battlefield.
The defenses against drift are straightforward in principle:
- Detect it continuously. Run a plan or a dedicated drift-detection job on a schedule that compares reality to the code and reports differences before they bite. A plan that shows unexpected changes is drift surfacing.
- Correct it deliberately. When drift appears, decide whether to pull reality back to the code (apply) or update the code to match an intentional change (commit). Never let it linger unresolved.
- Reduce the surface that can drift. Lock down console write access so manual changes are hard to make in the first place, and lean on immutable infrastructure (next section) so there is little a live box can drift into.
Catch drift across AWS, GCP, Azure, Linux, and Windows before the next apply silently reverts a fix.
Try Nova →Immutable vs mutable infrastructure
Drift has a structural cure, and it is a philosophy as much as a technique: stop changing running servers at all.
Mutable infrastructure is the traditional model. You provision a server once, then you patch it, update its packages, edit its configs, and tweak it over months and years. The server is long-lived and accumulates change. This is the model that produces snowflakes and drift, because every in-place edit is a chance for reality to diverge from the definition.
Immutable infrastructure flips the model: you never modify a running server in place. To change anything, even a one-line config edit, you build a new machine image with the change baked in, deploy fresh instances from that image, and discard the old ones. The running fleet is replaced, not patched. This is the famous cattle, not pets distinction: pets are hand-raised, named, nursed back to health when sick, and irreplaceable; cattle are numbered, interchangeable, and replaced when they have a problem rather than nursed.
| Dimension | Mutable (pets) | Immutable (cattle) |
|---|---|---|
| To make a change | Patch the live server | Build a new image, redeploy |
| Drift risk | High, every edit can diverge | Near zero, nothing changes live |
| Rollback | Reverse the change by hand | Redeploy the previous image |
| Reproducibility | Depends on discipline | Guaranteed by the image |
| Cost of a change | Cheap, edit in place | Full image build + redeploy |
Immutability reduces drift to almost nothing, because nothing changes on a live box, so reality cannot diverge from the image that built it. It also makes rollback trivial: to undo a bad change you redeploy the previous known-good image, which is the same mechanism you use for any deploy. And it removes a whole class of "works on that server but not this one" surprises, because every instance came from the same image.
The tradeoff is real: every change goes through an image build and a redeploy, which is more ceremony than editing a file in place. That cost is exactly why immutable infrastructure depends on a good pipeline (next section) and on golden images, pre-baked, versioned base images that already contain the common packages and hardening, so each new image build starts from a known foundation instead of from scratch. Tools like Packer build golden images; the IaC layer then provisions instances from them.
The IaC toolchain and patterns
The IaC ecosystem is not one tool but three layers that are often confused, plus a set of patterns that turn raw definitions into a maintainable system.
Provisioning vs configuration management vs orchestration
1Provisioning
Creates and destroys the infrastructure itself: virtual machines, networks, load balancers, clusters, managed databases. This is the existence layer. Tools: Terraform, OpenTofu (the open-source Terraform fork), and Pulumi (which lets you write definitions in general-purpose languages). Cloud-native options include AWS CloudFormation and Bicep for Azure.
2Configuration management
Takes a machine that already exists and brings its inside into a desired state: installed packages, files, services, users. Tools: Ansible, Chef, Puppet. In an immutable workflow this work shifts into image build time, but the discipline of declaring inside-the-box state still applies.
3Orchestration
Coordinates many moving pieces over time: rolling out changes across a fleet, sequencing dependent updates, keeping a desired number of replicas running. Kubernetes is the canonical orchestrator, and its controller pattern is the reconciliation loop applied continuously to containers.
4Policy as code
Encodes the rules infrastructure must obey, no public S3 buckets, every resource tagged, no instance types above a size, as code that runs against the plan before apply. Tools: Open Policy Agent, Sentinel, Checkov. Policy as code is how you scale IaC review without a human reading every plan.
The patterns that keep IaC maintainable
Modules. A module is a reusable, parameterized bundle of infrastructure: a "standard service" module might package a load balancer, an autoscaling group, security groups, and logging into one callable unit. Modules are to IaC what functions are to code, they let you define a pattern once and instantiate it consistently. They are also where most of the value of IaC compounds, because a fix to a module propagates everywhere it is used.
GitOps. GitOps is an operating model that makes a Git repository the single source of truth for both application and infrastructure state, with an automated agent that continuously reconciles the running system to match what is committed. Under GitOps you do not run apply from your laptop; you merge a pull request and a controller applies it. That gives you review gates, an audit trail, and automatic drift correction for free, because the reconciler keeps pulling reality back toward the committed state. GitOps is the reconciliation loop, lifted up to the level of your whole repository.
Policy as code ties the patterns together: modules standardize what good infrastructure looks like, GitOps standardizes how it gets applied, and policy as code standardizes the rules every change is checked against before it lands.
IaC in the pipeline
IaC reaches its full value only when it flows through the same delivery pipeline as your application code. That pipeline is the subject of its own deep-dive in our CI/CD guide; here is how IaC specifically moves through it.
On a pull request, the pipeline runs a plan. A plan is the declarative tool's dry run: it computes the difference between the committed code and the live state and prints exactly what it would create, change, or destroy, without touching anything. The plan is posted to the pull request so reviewers can see the consequences of the change before it happens. Automated policy and security checks run against that plan as gates, blocking a merge that would, say, open a database to the internet or remove a required tag.
After the change is approved and merged, the pipeline runs apply, which makes the plan real. In a GitOps setup, a controller does this automatically on merge; in a more traditional setup, the pipeline runs apply with scoped credentials. Either way the principle is the same: humans review the plan, automation runs the apply.
The blast radius of a bad apply. An application deploy that goes wrong usually breaks one service, and you roll forward. An infrastructure apply that goes wrong can delete a database, detach a network, or open a security group across your whole estate, and it runs with broad credentials at machine speed. A plan that reads as "1 to change" can hide a destroy and recreate that takes your data with it. This is why review of the plan is non-negotiable, why apply credentials should be least-privilege and short-lived, and why staged rollouts (apply to staging, then a canary, then production) matter more for infrastructure than for almost anything else.
Testing infrastructure is the part teams skip and regret. The plan itself is the first test: if it shows changes you did not expect, something is wrong. Beyond that, policy-as-code checks validate the plan against your rules, tools like Terratest stand up real infrastructure in a sandbox and assert it behaves, and ephemeral preview environments let a reviewer click through the actual result of a change before it merges. The discipline that makes IaC safe is the same discipline that makes application code safe: review, automated checks, staged rollout, and the ability to roll back fast.
IaC, reliability, and AI
Here is the connection that most IaC guides leave out: most infrastructure incidents trace back to a change, and in an IaC shop that change is almost always an apply or undetected drift.
Think about how an infrastructure incident actually unfolds. A pull request merges. An apply runs. Twenty minutes later latency climbs, error rates spike, or a service starts returning 503s. The on-call engineer is paged. Now they have to answer the hardest question in operations under time pressure: what changed? They scroll through the last ten merges, eyeball ten plans, cross-reference deploy timestamps against the symptom onset, and try to reason backward from a graph to a commit. That backward search is where most of the minutes in MTTR go, and it is exactly the kind of correlation work that drift and bad applies make miserable, because the change that broke things may not be the most recent one, and it may be a drift that no apply even recorded.
The reliability payoff of IaC is that the answer exists: every change is a reviewable, revertible commit, so in principle you can always find what changed and roll it back. The reliability risk of IaC is that the same automation that applies a good change applies a bad one at full speed and full blast radius. IaC does not remove the risk of change; it makes change faster, which raises the stakes on catching the bad change quickly. That is why reliable IaC is paired with continuous drift detection, plan review, staged applies, and observability that can correlate an infrastructure change to the symptoms it caused, the same way self-healing infrastructure closes the loop from detection to remediation.
This correlation problem, change to symptom across a sprawling multi-cloud estate, is exactly where AI earns its place in the IaC story. Nova AI Ops continuously compares reality to the declared state across AWS, GCP, Azure, Linux, and Windows. It flags drift before the next apply silently reverts a manual fix, and when an incident fires it does the backward search a human would do, instantly: it correlates the symptoms to the specific infrastructure change that caused them, explains the blast radius of that change, and can remediate within a policy envelope you define, for example reverting to the previous known-good state or re-applying the committed definition. The division of labor is the same one that runs through all good automation: the human writes the policy; the agent does the correlation and the safe, bounded fix. Instead of an engineer guessing which of the last ten applies broke production at 3 a.m., the agent points to the change and, inside the bounds you set, fixes it.
A 90-day plan and a 10-point checklist
Adopting IaC is not a big-bang migration. The teams that succeed phase it in over a quarter, proving each layer before adding the next.
Days 1–30: Get one environment under code
Pick a single, non-critical environment and bring it under declarative IaC end to end. Choose your provisioning tool (Terraform or OpenTofu are the safe defaults), set up a remote, locked, versioned state backend on day one, and import or rebuild the environment from code. Do not try to codify your whole estate yet. The goal of this phase is to learn the tool, establish the state discipline, and get the team comfortable reading a plan. By day 30 you should be able to destroy and recreate this environment from code alone.
Days 31–60: Pipeline, modules, and policy
Move apply out of human hands and into a pipeline. Wire plan-on-pull-request so every change shows its diff before merge, add least-privilege apply credentials, and add your first policy-as-code checks (no public buckets, mandatory tags). Refactor the repetitive parts of your definitions into modules so the next environment is a parameterized instance rather than a copy-paste. Connect IaC to your CI/CD pipeline so infrastructure changes ride the same rails as application code. By day 60, no one should be running apply from a laptop.
Days 61–90: Immutability, drift detection, and correlation
Adopt golden images and shift configuration into image build time so your fleet becomes immutable and drift drops toward zero. Turn on continuous drift detection so reality is compared to code on a schedule, not just when someone happens to run a plan. Finally, connect your IaC changes to observability and an agentic layer so that when a change causes an incident, the correlation from symptom to apply is automatic rather than a 3 a.m. manual search. By day 90 you have reproducible environments, reviewed applies, near-zero drift, and a fast path from "something broke" to "this apply did it."
The 10-point IaC checklist
Score your IaC practice against these ten. A mature setup answers yes to all of them.
- Is your state remote, locked, and versioned? State on a laptop or unlocked is a corruption and orphaned-resource incident waiting to happen.
- Is everything declarative and idempotent? Can you run apply twice and get the same result, with the second run a no-op?
- Does every infrastructure change go through a reviewed pull request with a plan attached? No console clicks, no laptop applies.
- Do you run automated policy-as-code checks against the plan? Public buckets, missing tags, oversized instances caught before merge, not after an audit.
- Are apply credentials least-privilege and short-lived? A bad apply should not be able to touch resources outside its scope.
- Do you stage applies, staging then canary then production? The blast radius of an infrastructure change is too large to apply everywhere at once.
- Have you reduced mutable surface with immutable infrastructure and golden images? The less that changes on a live box, the less can drift.
- Do you detect drift continuously, not just when someone runs a plan? Drift you find on a schedule is cheap; drift you find during an incident is expensive.
- Is your infrastructure factored into reusable modules? Copy-pasted definitions multiply every future fix; modules apply it once.
- Can you correlate an infrastructure change to the incident it caused, fast? When something breaks after an apply, finding the responsible change should take seconds, not a 3 a.m. archaeology session.
Frequently asked questions
What is Infrastructure as Code (IaC)?
What problems does IaC solve?
What is the difference between declarative and imperative IaC?
What is IaC state?
What is configuration drift?
What is immutable infrastructure?
What is the difference between provisioning and configuration management?
What is GitOps and how does it relate to IaC?
How does IaC fit into a CI/CD pipeline?
How does AI help with Infrastructure as Code?
Related guides
IaC is one pillar of a larger automation and reliability stack. Start with the surfaces closest to IaC: DevOps automation (the full automation surface IaC sits inside), CI/CD (the pipeline that plans and applies your IaC), and DevOps (the practice). On reliability foundations: site reliability engineering, AI SRE, Agentic SRE, and self-healing infrastructure. On planning and resilience: capacity planning and chaos engineering. On telemetry and operations: observability, monitoring, MTTR, SLOs and error budgets, and incident management. On the broader operations layer: AIOps, eliminating toil, runbooks, and cloud cost optimization. For teams shipping AI systems: LLMOps. And see the full platform on the features page.
See drift detection and change correlation on your real infrastructure.
Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams, running on AWS, GCP, Azure, Linux, and Windows. It compares reality to your declared state, catches drift, and correlates a bad apply to the incident it caused. Free tier available for small teams.