Terraform Module Design and State Operations: Building 'IaC That Doesn't Break' with Separation of Concerns, stg/prod State Splitting, and Drift Detection

"I want to codify my infrastructure"—as a requirement it's one line. But the moment you try to put Terraform into production, the things to decide multiply at once. From where do you extract a module? How do you structure a module? How deep do you allow nesting? Do you split state per environment, or are workspaces enough? Who runs apply with which permissions? And—how do you detect and crush the "drift" where code and reality silently diverge?

This article is an implementation guide to designing and operating Terraform in a maintainable form. As source material, I'll weave in design decisions from the in-house AI platform I built for a major Japanese broadcaster (codifying all of GCP 100% with Terraform) and a forestry-DX project (AWS, a Minister of Economy, Trade and Industry Award-winning product).

The rules of this article: module specs, state, and backend behavior are based on the HashiCorp official documentation (as of June 2026). Block and argument names are official-compliant, but Terraform's behavior changes by version, so always confirm the latest specs in the official documentation before going to production. The code is arranged in a form usable in real operation, but credentials and state bucket names are assumed to be in environment variables / backend configuration (never hardcode).

Cost optimization (FinOps) is a separate article. Tag design, taking stock of wasteful resources, budget guardrails, and other "how to keep costs down with Terraform" topics are split into Optimizing startup costs with Terraform (FinOps). This article focuses on module structure and state operations—that is, "the skeleton of IaC that doesn't break." The two are complementary.

0. Mental model: a good module, and state as "the reality of production"

Before getting into design, let me fix two mental models running through this article. If these waver, the module will later bloat and the state will have accidents.

① A good module = "one reason to change, clear inputs/outputs, reusable"

The official documentation defines a module's purpose as "combining the resource types a provider offers to describe new architectural concepts and raise the level of abstraction." And it warns clearly: "don't make modules that are thin wrappers around single other resource types."

	Good module	Bad module
Responsibility	One reason to change (SRP)	"Network, DB, and IAM"—everything bundled
Inputs/outputs	Explicit and minimal via `variable` / `output`	Can't tell what to pass to make it work without reading it
Structure	Flat (one level of child modules)	Deep nesting; modules spawning modules
Reuse	Can carry to another env / another project	Fused to this project only
Abstraction	Concept units like "VPC" or "Cloud Run service"	A thin wrapper around one resource (worthless)

② State = a snapshot of "the reality of production"

Terraform's state is a ledger mapping code to actual infrastructure. Think of it as the reality of production itself. That's exactly why—split it per environment, lock it to prevent simultaneous-update accidents, and separate concerns so you don't create drift (divergence between code and reality). IaC that treats state lightly will eventually fall into "you can't tell what production looks like even by reading the code."

With these two as axes, we descend into specifics in the order: extraction criteria → standard structure → composition → state isolation → CI gates → drift detection.

1. Module basics: the structure the official docs define

1.1 Root module and child modules

The official definition is clear.

Root module: the .tf files directly under the working directory. The starting point where you run the terraform command.
Child module: a module called from the root via a module block.

A module's source (source) can be loaded from a local path, the Terraform Registry, a Git repository, S3, a private registry, and more. The minimal invocation is this.

module "consul" {
  source  = "hashicorp/consul/aws"
  version = "0.1.0"

  servers = 3
}

source is where the module is fetched from, version pins which version, and the rest of the arguments (here servers) are that module's input variables.

1.2 How to write the source (official-compliant)

source has a format determined by where it comes from. Writing this ambiguously breaks reproducibility, so follow the official examples precisely.

# ① ローカルパス（必ず ./ か ../ で始める）
module "app_cluster" {
  source = "./app-cluster"
}

# ② Terraform Registry（NAMESPACE/NAME/PROVIDER）+ version 固定
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "6.0.1"
}

# ③ 汎用 Git（必ず ?ref でタグ/SHA を固定する）
module "vpc_git" {
  source = "git::https://example.com/vpc.git?ref=v1.2.0"
}

# ④ Git + SHA-1 ハッシュ固定（最も厳密）
module "storage" {
  source = "git::https://example.com/storage.git?ref=51d462976d84fdea54b47d80dcabbf680badcdb8"
}

The iron rule here is "always pin the version." The version argument is for Registry modules only. You can't attach version to a Git source, so instead pin it with ?ref=v1.2.0 (a tag) or ?ref=<SHA>. You can pass ref "a value git checkout accepts (a branch, a SHA-1, a tag)," but you must not make a mutable branch like main the ref of a production module. Yesterday's apply and today's apply would pull different things, and reproducibility vanishes.

1.3 Inputs, outputs, references

Ideally, a child module behaves like a pure function: it "receives inputs (variable), creates resources, and returns outputs (output)."

# 呼び出し側：入力を引数で渡す
module "ec2_instance" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "6.0.2"

  name          = "example-instance"
  ami           = data.aws_ami.latest_amazon_linux.id
  instance_type = "t2.micro"
}

# 別リソースから出力を参照：module.<LABEL>.<output>
resource "aws_security_group_rule" "example" {
  security_group_id = module.consul.security_group_id
}

Referencing an output is the fixed syntax module.<LABEL>.<output>. What you can touch from outside a module is only the values you exposed via output—this is the true nature of "clear inputs/outputs" as a module quality. A design where you touch internal resources directly from outside is a sign that encapsulation is broken.

1.4 Module meta-arguments: `for_each` / `count`

When you want to instantiate the same module multiple times, just like resources, count / for_each are usable on module blocks too.

locals {
  instance_configs = {
    web = { instance_type = "t2.micro" }
    job = { instance_type = "t3.small" }
  }
}

# for_each：キーごとに「異なる設定」で複数モジュールを生成
module "ec2_instance" {
  source   = "terraform-aws-modules/ec2-instance/aws"
  version  = "6.0.2"
  for_each = local.instance_configs

  name          = each.key
  instance_type = each.value.instance_type
}

# count：同型を「個数」で増やす
module "ec2_instance" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "6.0.2"
  count   = length(local.instance_names)

  name          = local.instance_names[count.index]
  instance_type = "t2.micro"
}

When to use which: if the elements are "a set identifiable by key," use for_each (deleting one element in the middle doesn't cause the others to be recreated due to index shifting); if they're just "N of them," use count. In production, make for_each your first choice. count has the trap that deleting a middle element of a list replaces everything after it.

Besides these, on module blocks you can also use providers (assigning alternative provider configurations) and depends_on (explicit dependency).

2. When to extract a module: drawing the line with YAGNI and SRP

This is the point most prone to misjudgment in Terraform design. "Modularize just in case" is a textbook technical debt. Let me drop the official warnings—"don't make thin wrappers around single resources," "keep the module tree as flat as possible"—into a practical decision table.

2.1 Extract / don't-extract decision table

Situation	Decision	Rationale
The same resource group is repeated in 3+ places	Extract	DRY. Twice is coincidence, three times is a pattern
It represents one concept like "VPC" or "Cloud Run service"	Extract	Raising abstraction = the official module purpose
You want to reproduce the same configuration across environments (stg/prod)	Extract	The core of reuse and reproducibility
It just thinly wraps one `aws_s3_bucket`	Don't extract	Officially discouraged (thin wrapper)
It's only used in one place so far	Don't extract (defer)	YAGNI. Once the duplication becomes real
30 `variable`s line up inside, and you can configure anything	Don't extract / split	A sign of "too many responsibilities," not "configurable"
Extracting actually makes the caller harder to read	Don't extract	KISS. If the abstraction's comprehension cost exceeds its value, it's worthless

Applying YAGNI: don't modularize ahead of time on "we might use it in another env in the future." Excessive module splitting breeds the worst debt of deep nesting and over-abstraction. Extraction is plenty in time once "the third duplication" or "a clear concept unit" appears.

Applying SRP: test whether you can state a module's responsibility in one sentence. "Creates a VPC and subnets" is OK. "Creates a VPC, and creates a DB, and sets up IAM"—when an "and" appears, it has too many responsibilities. It's a candidate for splitting.

2.2 Standard module structure

Following the standard structure the official docs recommend puts the same thing in the same place for everyone (maintainability).

modules/
  cloud-run-service/
    main.tf          # the resources themselves
    variables.tf     # inputs (variable blocks)
    outputs.tf       # outputs (output blocks)
    versions.tf      # required_version / required_providers
    README.md        # usage and input/output description
    examples/        # invocation examples (optional, but recommended in production)

File splitting isn't enforced by the tool, but the agreement of "resources in main, the entrance in variables, the exit in outputs" dramatically lowers a team's cognitive cost. In my broadcaster project, I codified all of GCP (VPC through Cloud Run, Cloud SQL, Cloud Armor, Secret Manager, Identity Platform, Workflows) with about 71 modules, but precisely because all modules align to this same structure, even at 71 you can read them without getting lost. Unifying the structure pays off more the larger the scale.

3. Composition first: avoid deep nesting

3.1 The official guidance: a flat tree + dependency inversion

The official documentation is clear about composition. "Instead of a deeply nested module tree, use module composition," "keep the module tree as flat as possible (to one level of child modules)."

Its core technique is dependency inversion. In the official words:

"Rather than a module embedding its own dependencies and creating and managing its own copies, the root module passes the dependencies in."

Let me place a bad design (nesting, fusion) and a good design (composition, inversion) side by side.

# ❌ 悪い：consul_cluster が自分でネットワークを内部生成する
#    → ネットワークだけ差し替えたい時にモジュールを書き換える羽目になる
module "consul_cluster" {
  source = "./modules/aws-consul-cluster"
  # 内部で aws_vpc / aws_subnet を勝手に作っている…（隠れた依存）
}

# ✅ 良い：ネットワークは別モジュールで作り、ID を「入力として渡す」
module "network" {
  source = "./modules/aws-network"
  # ...
}

module "consul_cluster" {
  source = "./modules/aws-consul-cluster"

  vpc_id     = module.network.vpc_id      # 依存を外から注入
  subnet_ids = module.network.subnet_ids
}

3.2 Why this works

The advantages the official docs cite are software-design principles themselves.

Flexibility (ETC: Easy To Change): even if you swap the dependency's origin from "resource → data source," the consul_cluster side is unchanged—because it depends on the interface (receiving vpc_id).
Reusability: because modules are small and loosely coupled, you can reuse them in different combinations.
Clarity: reading the root module gives you a panoramic view of "which parts connect how."

Deep nesting is the enemy of ETC. Nesting A → B → C → D creates a "variable bucket relay" where, to pass a value to D, you must thread arguments through all of A → B → C, and nobody can tell what to change to make what work. One flat level + composition at the root is the only structure that doesn't break down as scale grows.

The official docs also introduce multi-cloud abstraction (a thin abstraction expressing common concepts like "DNS record" or "Kubernetes cluster" with object-type variables), but they caution that you end up accepting the trade-off of the "lowest common denominator." Is it really worth unifying by discarding each vendor's powerful specific features?—reconsider it with YAGNI.

4. State operations ①: environment isolation (stg / prod)

From here is the heart of production operations. State is the reality of production—so it needs to be reliably split per environment.

4.1 Remote state + locking

According to the official docs, Terraform stores state locally in terraform.tfstate by default, but in team operations you place it in a remote backend (HCP Terraform / S3 / GCS / Azure Blob, etc.). There are two purposes to going remote.

Sharing: the whole team sees the same reality.
Locking: "prevent simultaneous Terraform runs against the same state." It's the lifeline that prevents state corruption and conflicts from simultaneous apply.

You can write exactly one backend inside the terraform block.

# GCS バックエンド（放送事業者案件で採用した GCP 構成）
terraform {
  backend "gcs" {
    bucket = "my-org-tfstate-prod"     # state を置くバケット
    prefix = "platform/infra"          # バケット内のパス
  }
}

# S3 バックエンド + ネイティブロック（林業DX案件で採用した AWS 構成）
terraform {
  backend "s3" {
    bucket       = "my-org-tfstate-prod"
    key          = "platform/infra/terraform.tfstate"
    region       = "ap-northeast-1"
    encrypt      = true
    use_lockfile = true   # S3 ネイティブのロック（旧来の DynamoDB ロック不要）
  }
}

A note from the forestry-DX project: previously, the standard was to pair a DynamoDB table for S3 locking, but I manage state with S3 native locking (use_lockfile), preventing simultaneous runs without holding the extra resource of a lock-dedicated table. One fewer resource = one fewer thing to manage, which pays off in a quiet way (cost efficiency, lower operational load).

An important official constraint: a backend block cannot reference "named values" like variables, locals, or data sources. A form like bucket = var.state_bucket is not allowed. When you want to change values per environment, use the directory separation described below, or partial configuration.

# 部分設定：backend ブロックは空にしておき、init 時に値を注入する
terraform init -backend-config="bucket=my-org-tfstate-stg" \
               -backend-config="prefix=platform/infra"
# あるいはファイルで：terraform init -backend-config=stg.backend.hcl

4.2 Environment isolation: workspaces vs separate state (the most important decision)

On "how to split stg and prod," many teams reach for workspaces and regret it. The official guidance is clear, so let me quote it first.

"Workspaces are not appropriate for system decomposition or deployments requiring separate credentials and access controls."

That is, the official docs themselves say "don't split prod and staging with workspaces." Let me put it in a decision table.

Aspect	Workspaces (`terraform workspace`)	Separate state (directory / backend separation)
Credential separation	Can't (same backend, same auth)	Can (a prod bucket, a prod role)
Isolation of mishaps	Risk of breaking prod by forgetting to `select`	Directories are physically separate = harder to have accidents
Swapping backend config	Not possible (one shared backend)	Possible (change bucket/prefix per environment)
Access control	Hard to split IAM per environment	Can assign a least-privilege role per environment
Suited for	Short-lived parallel work on the same configuration (trying out a feature branch, light verification)	Environment isolation including production (stg/prod)
Official stance	Discouraged for strong isolation	The recommended form for strong isolation

Conclusion: split stg / prod by separate state (directory separation). This is official-compliant, and it was my choice in real projects.

envs/
  stg/
    main.tf            # module "platform" { source = "../../modules/..." }
    backend.tf         # backend "gcs" { bucket = "...-tfstate-stg" }
    terraform.tfvars   # stg 固有の値（インスタンスサイズ等）
  prod/
    main.tf            # 同じモジュールを呼ぶ（再現性）
    backend.tf         # backend "gcs" { bucket = "...-tfstate-prod" }
    terraform.tfvars   # prod 固有の値
modules/               # stg/prod が共有する「構成の定義」
  cloud-run-service/
  network/
  ...

The point is the form where modules/ is one (DRY: the single source of truth for the configuration) / envs/<env>/ calls it and gives per-environment values (reproducibility). You can lift a configuration verified in stg straight up to prod, while prod's state, auth, and permissions are physically isolated. Use workspaces only for situations where weak isolation is fine, such as "disposable verification of the same configuration."

# ワークスペースを使う数少ない場面の例（強い分離が不要なとき）
terraform workspace new feature-x
terraform workspace select feature-x
terraform workspace list
# 構成内では terraform.workspace で現在のワークスペース名を参照できる

4.3 Passing values between states: terraform_remote_state

When you want to share values across environments or layers (e.g., getting a VPC ID from the network layer's state), the official terraform_remote_state data source lets you reference it read-only.

# 別ステート（ネットワーク層）の出力を読み取る
data "terraform_remote_state" "network" {
  backend = "gcs"
  config = {
    bucket = "my-org-tfstate-prod"
    prefix = "platform/network"
  }
}

resource "google_cloud_run_v2_service" "api" {
  # 別ステートが output しているネットワーク情報を参照
  # （注：network 層側で output "vpc_connector" を公開していること）
  template {
    vpc_access {
      connector = data.terraform_remote_state.network.outputs.vpc_connector
    }
  }
}

This makes a loosely-coupled team split hold: the core-platform team exposes "only the information OK to share with other teams" via output, and the consuming side receives it read-only. The caveat is that you can only get the values the referenced side exposed via output. Here too, the encapsulation principle of "what you touch from outside is only output" is at work.

5. State operations ②: prevent drift with separation of concerns

Drift is the divergence between the state the code declared and the reality of the actual infrastructure. This is the biggest cause of IaC breaking. And the biggest source of drift is "mixed responsibilities."

5.1 Don't mix app deployment with infra changes

The most common anti-pattern is running app deployment (updating container images) and infra configuration changes in the same Terraform. Do this, and:

Every time the app team changes the image tag, Terraform runs,
Unintended infra changes get mixed into that diff,
and as a result you can no longer trace "who changed what, when."

My solution in the broadcaster project was to clearly split responsibilities into two systems.

Responsibility	Tool in charge	What it manages
App delivery	Cloud Build	Building the container image & reflecting the latest env
Infra configuration	Terraform	VPC / Cloud Run definitions / Cloud SQL / Cloud Armor / Secret Manager, etc.

In other words, I separated responsibilities as "container images and the latest env are Cloud Build" and "infra configuration is Terraform." Terraform declares only "the shape of the execution platform" and doesn't get involved with the version of the app running on top. This means the frequently-changing app deployments no longer shake Terraform's state, and the main cause of drift structurally disappears (a form of applying SRP to infra operations).

# Terraform は「実行基盤の形」を宣言する。動くイメージのタグは Terraform の管理外。
resource "google_cloud_run_v2_service" "api" {
  name     = "content-api"
  location = var.region

  template {
    containers {
      # ❌ ここに :v1.2.3 のような可変タグを直書きすると、
      #    デプロイのたびに Terraform が diff を検知してドリフト源になる
      image = var.container_image  # 値の供給はデリバリ側に委ね、ライフサイクルを切る
    }
  }

  lifecycle {
    # アプリのデリバリ系が更新する属性を Terraform の管理対象から外す
    ignore_changes = [template[0].containers[0].image]
  }
}

Declaring "the attributes Terraform won't touch" with lifecycle { ignore_changes } is a practice that guarantees separation of concerns at the code level. Without this, every time Cloud Build updates the image, Terraform tries to "put it back" and a diff appears forever.

5.2 Pin module versions (reproducibility)

The pinning of version / ?ref described through Chapter 3 is also a drift countermeasure. If a module reference is the main branch, the moment the upstream module is updated, an unfamiliar diff wells up in your plan—this too is a kind of drift. Always pin with version for the Registry, ?ref=<tag/SHA> for Git, and do version updates as deliberate PRs.

6. CI gates: plan(tfsec) → apply(permission-boundary role) → drift detection

Once you've split state and responsibilities, the last step is to move to an operation where "humans don't apply from their laptops." Local apply is a hole in permissions, auditing, and reproducibility all at once. In my forestry-DX project (AWS), I automated Terraform with 3 gates: plan (tfsec on PR) / apply (a role with a permission boundary) / drift detection (periodic cron → file an Issue).

6.1 plan + tfsec on PR (the static-check gate)

The moment a PR is raised, visualize the plan diff and run a security static check with tfsec. The iron rule is to change nothing here (read-only).

# .github/workflows/terraform-plan.yml（PR時：read-only）
name: terraform-plan
on:
  pull_request:
    paths: ["envs/**", "modules/**"]

permissions:
  contents: read
  id-token: write        # OIDC で鍵レスにロールを引き受ける（後述リンク参照）
  pull-requests: write   # plan 結果を PR にコメント

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC, read-only role)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/tf-plan-readonly
          aws-region: ap-northeast-1

      - uses: hashicorp/setup-terraform@v3

      - run: terraform -chdir=envs/prod init
      - run: terraform -chdir=envs/prod plan -no-color -lock-timeout=60s

      # セキュリティ静的検査（公開バケット・暗号化漏れ等を CI でブロック）
      - name: tfsec
        uses: aquasecurity/tfsec-action@v1.0.0
        with:
          working_directory: .

The point is to keep the plan role read-only. There's no reason to grant write permissions at the PR stage (least privilege). The apply authentication (the mechanism of assuming a role without holding keys via OIDC) is detailed in a separate article, Keyless CI/CD with GitHub Actions OIDC. Don't put long-lived access keys in CI—this is a non-negotiable premise.

6.2 apply on merge (a role with a permission boundary)

Run apply only on a merge to main (= review-approved). The role used here gets a permission boundary, capping "the upper limit of resources Terraform may touch" at the IAM level.

# .github/workflows/terraform-apply.yml（main マージ時のみ）
name: terraform-apply
on:
  push:
    branches: ["main"]
    paths: ["envs/**", "modules/**"]

permissions:
  contents: read
  id-token: write

concurrency:
  group: tf-apply-prod   # 同一環境への apply を直列化（ステート競合の予防）
  cancel-in-progress: false

jobs:
  apply:
    runs-on: ubuntu-latest
    environment: production   # GitHub Environments の承認ゲートを噛ませる
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC, boundaried apply role)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          # 権限境界つきの apply 専用ロール。境界外のリソースは触れない。
          role-to-assume: arn:aws:iam::123456789012:role/tf-apply-boundaried
          aws-region: ap-northeast-1

      - uses: hashicorp/setup-terraform@v3
      - run: terraform -chdir=envs/prod init
      - run: terraform -chdir=envs/prod apply -auto-approve -lock-timeout=300s

There are three key design points.

Permission-boundary role: attach a permission boundary to the apply role to physically block operations on unexpected services/resources at the IAM level (least privilege). Whatever the Terraform code says, it can't touch outside the boundary.
Serialization: narrow apply against the same environment to one with concurrency, and combined with -lock-timeout, doubly prevent state-lock conflicts.
Approval gate: with environment: production, interpose one human approval, balancing automation and governance.

6.3 Periodic drift detection (cron → file an Issue)

Passing CI doesn't make drift zero. With manual changes in the console, external factors, and events outside the permission boundary, reality silently diverges. So place a gate that periodically inspects "the difference between code and reality" and files an Issue when divergence is detected.

# .github/workflows/terraform-drift.yml（定期実行：現実とのズレを検知）
name: terraform-drift
on:
  schedule:
    - cron: "0 0 * * *"   # 毎日。現実との乖離を放置しない
  workflow_dispatch: {}

permissions:
  contents: read
  id-token: write
  issues: write           # ドリフトを検知したら Issue を起票

jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC, read-only role)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/tf-plan-readonly
          aws-region: ap-northeast-1

      - uses: hashicorp/setup-terraform@v3
      - run: terraform -chdir=envs/prod init

      # -detailed-exitcode: 差分なし=0 / 差分あり=2 / エラー=1
      - name: Detect drift
        id: plan
        run: terraform -chdir=envs/prod plan -detailed-exitcode -lock-timeout=60s
        continue-on-error: true

      - name: Open issue on drift
        if: steps.plan.outputs.exitcode == 2
        uses: actions/github-script@v7
        with:
          script: |
            await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: "⚠️ Terraform drift detected in prod",
              body: "定期 plan が差分(exit code 2)を検知しました。コードと本番の現実がズレています。手動変更の有無を確認し、コードに反映するか revert してください。",
              labels: ["drift", "infra"],
            });

The key is terraform plan -detailed-exitcode. It returns exit 0 if there's no diff, exit 2 if there's a diff, and 1 on error, so you can use this to "auto-file an Issue when it drifts." The essence is changing drift from "fix it when you notice" to "a machine finds it and files it every day." A read-only role suffices for detection, just like plan (least privilege).

7. Production-operations checklist

Let me fold the above into one sheet from an operational view. These are the items I always confirm when productionizing a new Terraform repository.

State: placed in a remote backend (GCS / S3) with locking enabled (use_lockfile for S3). terraform.tfstate is not committed.
Environment isolation: stg / prod split by separate state (directory separation). Production is not split by workspaces (officially discouraged).
Separation of concerns: app delivery (image updates) and infra changes are split. Mutable attributes are taken out of Terraform's management with ignore_changes.
Modules: no thin wrappers made. The tree is flat (one level). Dependencies are injected from outside (dependency inversion). Versions pinned with version / ?ref.
CI gates: plan + tfsec on PR (read-only role). apply on merge (permission-boundary role, serialized, approval gate). No local apply.
Auth: CI is keyless via OIDC (no long-lived access keys). The apply role is least-privilege + permission boundary.
Drift: detect divergence with periodic plan -detailed-exitcode and file an Issue. Not left unattended.

Maintainability (the same structure for everyone), reproducibility (lift stg straight to prod), and governance (trace who changed what with which permissions)—only when these three are together do you have "IaC that doesn't break."

8. Summary: cheat sheet

Finally, a quick reference for when you're unsure.

Extract a module: on the third duplication, or for a clear concept unit like "VPC" or "Cloud Run service." Don't make thin wrappers (officially discouraged) / don't get ahead of yourself (YAGNI).
Module structure: unify on main.tf / variables.tf / outputs.tf / versions.tf. The entrance is variable, the exit is output, and what you touch from outside is only output.
Composition: keep the tree flat (one level). Don't create dependencies internally—inject them from outside (dependency inversion). Deep nesting = a variable bucket relay = the enemy of ETC.
Pin versions: version for the Registry, ?ref=<tag/SHA> for Git. main references forbidden (reproducibility / drift prevention).
Environment isolation: split stg / prod by separate state (directory + backend separation). Workspaces are only for "short-lived work that doesn't need strong isolation" (officially discouraged for prod isolation).
Locking: a remote backend + simultaneous-run locking (use_lockfile for S3). CI doubly defends with concurrency + -lock-timeout.
Separation of concerns: split app delivery (Cloud Build, etc.) from infra (Terraform). Mutable attributes get ignore_changes. This is the biggest drift prevention.
CI gates: PR = plan + tfsec (read-only) / merge = apply (permission-boundary role, approval) / periodic = drift detection (-detailed-exitcode → Issue). Local apply abolished.
Auth: keyless via OIDC (→ details). Cost optimization is out of scope for this article (→ the FinOps edition).

Terraform is not just a tool to "codify infrastructure." Cut responsibilities with modules, isolate and lock the reality of production with state, take apply away from human laptops with CI, and have a machine watch for drift—only when you do all this do you get IaC that doesn't break as scale grows.

In the in-house AI platform for a broadcaster, I codified all of GCP with about 71 modules, isolated stg / prod state, and prevented drift by splitting responsibilities between Cloud Build and Terraform. In the forestry-DX project (AWS, a Minister of Economy, Trade and Industry Award-winning product), I manage the infrastructure with 17 modules and S3 native locking, and automate plan (tfsec on PR) / apply (permission-boundary role) / drift detection (periodic cron → file an Issue). With one person × generative AI (Claude Code), I've designed and operated 100% Terraform infrastructure spanning GCP / AWS.

"How to codify your infrastructure, and how to put it on operations in a form that doesn't break"—from that design through CI setup and drift operations, I can accompany you end-to-end. Consultations like "our existing Terraform has bloated and we can't touch it" are welcome too. Feel free to reach out.

References (official documentation)

Modules Overview — Terraform — module definition, root/child, kinds of sources
Module Blocks (Use modules) — Terraform — the module block, source/version, count/for_each, output references
Module Sources — Terraform — source formats for local/Registry/Git, ?ref pinning
Develop Modules — Terraform — the standard module structure, "don't make thin wrappers," a flat tree
Module Composition — Terraform — composition, dependency inversion, the flat-tree guidance
Backend Configuration — Terraform — terraform { backend ... }, the one-backend constraint, no variable references, partial configuration
Remote State — Terraform — remote state, locking, terraform_remote_state
Workspaces — Terraform — workspace definition, terraform.workspace, the official warning that they're discouraged for production isolation

Terraform Module Design and State Operations: Building 'IaC That Doesn't Break' with Separation of Concerns, stg/prod State Splitting, and Drift Detection

0. Mental model: a good module, and state as "the reality of production"

1. Module basics: the structure the official docs define

1.1 Root module and child modules

1.2 How to write the source (official-compliant)

1.3 Inputs, outputs, references

1.4 Module meta-arguments: `for_each` / `count`

2. When to extract a module: drawing the line with YAGNI and SRP

2.1 Extract / don't-extract decision table

2.2 Standard module structure

3. Composition first: avoid deep nesting

3.1 The official guidance: a flat tree + dependency inversion

3.2 Why this works

4. State operations ①: environment isolation (stg / prod)

4.1 Remote state + locking

4.2 Environment isolation: workspaces vs separate state (the most important decision)

4.3 Passing values between states: terraform_remote_state

5. State operations ②: prevent drift with separation of concerns

5.1 Don't mix app deployment with infra changes

5.2 Pin module versions (reproducibility)

6. CI gates: plan(tfsec) → apply(permission-boundary role) → drift detection

6.1 plan + tfsec on PR (the static-check gate)

6.2 apply on merge (a role with a permission boundary)

6.3 Periodic drift detection (cron → file an Issue)

7. Production-operations checklist

8. Summary: cheat sheet

References (official documentation)

AWS ECS on Fargate vs EKS: 7 Evaluation Axes a Startup Should Decide in 3 Months, and an Implementation-Cost Comparison

Making GitHub Actions Keyless with OIDC: Throwing Away Long-Lived Keys with AWS IAM Roles and GCP Workload Identity Federation

Designing Defense-in-Depth with a WAF: Rolling Out AWS WAF / Cloud Armor's OWASP Countermeasures, Rate Limiting, and DDoS Mitigation to Production Without False Positives

You Can Halve Your Server Bill with 'Design': A Terraform × FinOps Practical Guide to Cutting a Startup's AWS Monthly Bill by 30–50%

Also worth reading

The Complete AWS CloudTrail Guide (2026 Edition): Designing API Activity Auditing, Trails, CloudTrail Lake, Athena Analysis, and Real-Time Detection at Production Quality

Designing AWS Threat Detection in Production with Amazon GuardDuty: Protection Plans, Extended Threat Detection, Org-Wide Bulk Enablement, and EventBridge Automated Response, in Real Code

DynamoDB Capacity, Cost, and Performance Design Complete Guide (2026 Edition): On-Demand vs. Provisioned, Auto Scaling, Avoiding Hot Partitions, Cost Optimization

0. Mental model: a good module, and state as "the reality of production"

1. Module basics: the structure the official docs define

1.1 Root module and child modules

1.2 How to write the source (official-compliant)

1.3 Inputs, outputs, references

1.4 Module meta-arguments: for_each / count

2. When to extract a module: drawing the line with YAGNI and SRP

2.1 Extract / don't-extract decision table

2.2 Standard module structure

3. Composition first: avoid deep nesting

3.1 The official guidance: a flat tree + dependency inversion

3.2 Why this works

4. State operations ①: environment isolation (stg / prod)

4.1 Remote state + locking

4.2 Environment isolation: workspaces vs separate state (the most important decision)

4.3 Passing values between states: terraform_remote_state

5. State operations ②: prevent drift with separation of concerns

5.1 Don't mix app deployment with infra changes

5.2 Pin module versions (reproducibility)

6. CI gates: plan(tfsec) → apply(permission-boundary role) → drift detection

6.1 plan + tfsec on PR (the static-check gate)

6.2 apply on merge (a role with a permission boundary)

6.3 Periodic drift detection (cron → file an Issue)

7. Production-operations checklist

8. Summary: cheat sheet

References (official documentation)

Related articles

AWS ECS on Fargate vs EKS: 7 Evaluation Axes a Startup Should Decide in 3 Months, and an Implementation-Cost Comparison

Making GitHub Actions Keyless with OIDC: Throwing Away Long-Lived Keys with AWS IAM Roles and GCP Workload Identity Federation

Designing Defense-in-Depth with a WAF: Rolling Out AWS WAF / Cloud Armor's OWASP Countermeasures, Rate Limiting, and DDoS Mitigation to Production Without False Positives

You Can Halve Your Server Bill with 'Design': A Terraform × FinOps Practical Guide to Cutting a Startup's AWS Monthly Bill by 30–50%

Also worth reading

The Complete AWS CloudTrail Guide (2026 Edition): Designing API Activity Auditing, Trails, CloudTrail Lake, Athena Analysis, and Real-Time Detection at Production Quality

Designing AWS Threat Detection in Production with Amazon GuardDuty: Protection Plans, Extended Threat Detection, Org-Wide Bulk Enablement, and EventBridge Automated Response, in Real Code

DynamoDB Capacity, Cost, and Performance Design Complete Guide (2026 Edition): On-Demand vs. Provisioned, Auto Scaling, Avoiding Hot Partitions, Cost Optimization

1.4 Module meta-arguments: `for_each` / `count`