AVM-Series

What is Infrastructure as Code and why does it matter?

Infrastructure as Code

Infrastructure as Code (IaC) is the practice of defining and managing cloud infrastructure through machine-readable configuration files rather than through manual processes. This article explains what problem IaC solves, how it solves it, and why Terraform is the tool this series uses to do it.

The problem: infrastructure that can’t be trusted

Manual infrastructure management produces environments that drift. “Drift” means the live state of your systems no longer matches any documented or intended configuration. It accumulates in predictable ways: an engineer opens a firewall port during an incident and doesn’t close it. A virtual machine gets a patch that its peers don’t. A storage account gets its access tier changed through the cloud portal. None of these changes are tracked. None are reviewed. None can be easily undone.

The consequence isn’t just operational noise. Drift directly causes outages and security failures. The 2021 Facebook global outage — which took down Facebook, Instagram, and WhatsApp for roughly six hours — traced back to a configuration command that withdrew BGP routes from the company’s backbone routers. The auditing tool that should have blocked the command failed silently. The 2012 Knight Capital trading disaster, which cost the firm $440 million in 45 minutes, resulted from a manual server deployment where a technician missed one of eight servers. The inconsistency reactivated decade-old trading code that no one intended to run.

Both failures share a common root: infrastructure whose actual state was unknown, unverifiable, and unrecoverable. Neither had a system that enforced a defined, auditable configuration.

IaC is the structural solution to this problem. When your infrastructure is defined as code, the intended state is explicit and version-controlled. Changes require edits to files, not clicks in a console — which means they can be reviewed, tested, and rejected before they reach production. The live environment can be compared against the defined state at any time. Environments can be destroyed and rebuilt from scratch with known, predictable results.

Two paradigms: imperative vs. declarative

There are two ways to use code to manage infrastructure, and they behave differently enough that conflating them causes real problems.

Imperative code specifies how to do something — a sequence of steps executed in order. A Bash script that calls the Azure CLI to create a resource group, then a storage account, then a subnet is imperative. It works the first time. Run it a second time and it errors: the resource group already exists. If you want it to handle that case, you write explicit existence checks. As infrastructure grows, imperative scripts grow with it — create scripts, update scripts, delete scripts, each requiring their own logic.

Declarative code specifies what you want — the desired end state — and delegates the “how” to the tool. The following Terraform configuration declares that a resource group named rg-example should exist in eastus:

resource "azurerm_resource_group" "example" {
  name     = "rg-example"
  location = "eastus"
}

Run terraform apply once and it creates the resource group. Run it again and Terraform reports no changes, because the actual state already matches the desired state. This property — producing the same result on every execution regardless of starting conditions — is called idempotency. It’s not a nice-to-have. It’s what makes infrastructure automation reliable at scale.

The tradeoff is real: declarative tools require learning a domain-specific language and managing a state file (explained in the next article). Imperative scripts are immediately familiar to anyone who knows Bash or PowerShell. The reason declarative wins at scale is that the complexity of managing infrastructure state manually — tracking what exists, handling partial failures, writing rollback logic — grows faster than the complexity of learning Terraform’s syntax.

For Azure-only environments, Microsoft’s own Bicep language is a viable declarative alternative. Bicep compiles to ARM templates, stores state in Azure’s control plane (no separate state file), and supports new Azure features on their preview date. It’s lighter-weight for single-cloud use. This series uses Terraform because the organization’s requirements span multiple environments and the tooling should not be Azure-specific. If you know your scope is Azure-only and will remain so, Bicep is worth evaluating before committing to Terraform.

What IaC enables beyond automation

Automation is the visible benefit of IaC. The deeper benefits come from what becomes possible once infrastructure is expressed as text files stored in a repository.

Version control gives you a full, auditable history of every infrastructure change — who made it, when, and what changed. Rolling back means reverting a commit and reapplying. Comparing environments means diffing files.

Code review means infrastructure changes go through the same pull request process as application code. A second pair of eyes on a networking change before it reaches production is the mechanism that catches the kinds of errors that caused the Facebook and Knight Capital incidents.

Reproducibility means an environment is not a special, hand-crafted artifact — it’s an instance of a template. Development, staging, and production are the same configuration with different parameter values. Disaster recovery means running terraform apply in a new region, not reconstructing months of manual work from memory and incomplete documentation.

Drift detection means Terraform can compare the desired state (your configuration files) against the actual state (what’s live in Azure) and show you exactly what has diverged. Manual changes made through the portal become visible rather than invisible.

None of these require Terraform specifically. They require that infrastructure be expressed as code stored in version control. Terraform is the tool; the practice is the point.

How this series uses IaC: layers and modules

This series treats infrastructure as a set of distinct, independently managed layers. The governance layer (management groups, policies) is separate from the networking layer (hub VNets, firewalls) which is separate from the management layer (logging, monitoring) which is separate from application landing zones. Each layer has its own Terraform configuration and its own state file.

The mechanism that makes this work is the module — a reusable package of Terraform configuration that accepts inputs, provisions resources, and exposes outputs. A module is to infrastructure what a function is to application code: a named, versioned, composable unit that encodes a specific responsibility.

Microsoft’s Azure Verified Modules (AVM) are the modules this series uses throughout. They are Microsoft-maintained, tested, and published to the Terraform Registry. Each module targets a specific scope — management groups and policies, hub networking, log analytics, subscription provisioning — and can be composed with others.

The layered approach means each layer is independently updatable. Changing the networking configuration doesn’t touch governance state. Adding a new Log Analytics workspace doesn’t affect hub VNet resources. Application teams can provision their own landing zones without modifying platform infrastructure. This independence is not incidental — it’s a design goal. The modular structure is what makes a complex enterprise deployment maintainable by a team rather than opaque to everyone except the person who built it.

Terraform is not a scripting language

This is worth stating plainly before you write a single line of Terraform, because the scripting mental model produces real errors when applied here.

When you write a Bash script, execution flows from top to bottom. Line 5 runs before line 10. If you define a storage account before a resource group, it fails because the resource group doesn’t exist yet. Order is execution.

Terraform does not work this way. When you run terraform plan or terraform apply, Terraform loads every .tf file in your working directory simultaneously and treats them as a single, unordered configuration. It then constructs a directed acyclic graph (DAG) of all resources and their dependencies. The order resources appear in your files is irrelevant. What determines execution order is the dependency graph.

Dependencies come from two sources. Implicit dependencies are detected automatically. When one resource references an attribute of another — for example, setting resource_group_name = azurerm_resource_group.example.name — Terraform sees that the resource group’s output is needed as input and adds a directed edge in the graph: the resource group must be created before the resource that references it. Explicit dependencies are declared with the depends_on meta-argument for cases where one resource needs another to exist but doesn’t reference its attributes.

Once the graph is built, Terraform walks it in parallel. Resources with no dependencies on each other execute concurrently — by default, up to ten resources at a time. Resources with dependencies execute only after their dependencies complete. For destruction, the graph is reversed.

The practical consequence of this is that the ordering of your plan output is not the ordering of your files. It’s the ordering Terraform determined from the dependency graph. If resources appear to execute in an unexpected order, the answer is not to reorder your files — it’s to check whether the dependency relationship is correctly expressed.

You can inspect Terraform’s dependency graph directly by running terraform graph, which outputs DOT-format data that tools like GraphViz can render visually. This is useful for debugging complex configurations where the dependency chain isn’t obvious.

Terraform’s position in the Azure ecosystem

Terraform is not Azure-native. It’s a multi-cloud infrastructure tool produced by HashiCorp (acquired by IBM in February 2025) that communicates with cloud providers through a plugin system called providers. Two providers are relevant for Azure work.

The AzureRM provider (hashicorp/azurerm, currently version 4.x) is the primary interface for Azure resource management. It covers the large majority of stable, generally available Azure services — virtual networks, storage accounts, compute, databases, Key Vault, AKS, and several hundred more. It has been downloaded approximately 1.4 billion times from the Terraform Registry and is maintained jointly by HashiCorp and Microsoft.

The AzAPI provider (azure/azapi, currently version 2.x) is maintained directly by Microsoft and takes a different approach: it’s a thin wrapper around Azure’s ARM REST APIs, supporting any resource type at any API version with no lag. Where AzureRM requires a provider update to support a new Azure service, AzAPI can manage that service on day zero. Azure Verified Modules use AzAPI for resources where coverage or timing would otherwise be a limitation — management groups, policy assignments, and cross-subscription operations are handled through AzAPI in the modules this series uses.

The two providers are designed to coexist in the same configuration. In practice, most configurations use AzureRM for the bulk of resources and AzAPI for specific cases. As a beginner, you’ll interact with AzureRM directly and consume AzAPI indirectly through the AVM modules.

References: HashiCorp Terraform internals documentation (developer.hashicorp.com/terraform/internals/graph); Azure Verified Modules index (azure.github.io/Azure-Verified-Modules); Engineering at Meta outage post-mortem (engineering.fb.com/2021/10/04/networking-traffic/outage); AzureRM provider changelog (registry.terraform.io/providers/hashicorp/azurerm/latest).