# Terraform Module Design and State Operations: Building 'IaC That Doesn't Break' with Separation of Concerns, stg/prod State Splitting, and Drift Detection

> An implementation guide to designing maintainable IaC with Terraform. From the criteria for extracting modules and the standard structure, composition-first, per-environment state isolation plus remote state + locking, drift prevention via separation of concerns, to CI gates of plan-with-tfsec / apply-with-a-permission-boundary-role / periodic drift detection—all explained with real configuration. Cost optimization (FinOps) is split into a separate article; this one focuses on structure and state operations.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: Terraform, IaC, アーキテクチャ設計, GCP, AWS
- URL: https://tomodahinata.com/en/blog/terraform-module-design-state-isolation-drift-detection-guide
- Category: Infrastructure, IaC & CI/CD
- Pillar guide: https://tomodahinata.com/en/blog/aws-ecs-vs-eks-startup-decision-framework

## Key points

- A good module has one reason to change, clear inputs/outputs, and is reusable; don't make a thin wrapper around a single resource
- Avoid deep nesting and keep it flat with composition (dependency inversion) that injects dependencies from the root rather than generating them internally
- Splitting stg / prod by separate state (directory separation), not workspaces, is official-compliant and safe
- Separate app delivery from infra changes, and structurally prevent drift on mutable attributes with ignore_changes
- Automate CI with 3 gates: PR = plan + tfsec (read-only) / merge = apply (permission-boundary role) / periodic drift detection

---

"I want to codify my infrastructure"—as a requirement it's one line. But the moment you try to put Terraform into production, the things to decide multiply at once. **From where do you extract a module? How do you structure a module? How deep do you allow nesting? Do you split state per environment, or are workspaces enough? Who runs `apply` with which permissions? And—how do you detect and crush the "drift" where code and reality silently diverge?**

This article is an implementation guide to designing and operating Terraform in a **maintainable** form. As source material, I'll weave in design decisions from the in-house AI platform I built for a major Japanese broadcaster (codifying all of GCP 100% with Terraform) and a forestry-DX project (AWS, a Minister of Economy, Trade and Industry Award-winning product).

> **The rules of this article**: module specs, state, and backend behavior are based on the **HashiCorp official documentation (as of June 2026)**. Block and argument names are official-compliant, but Terraform's behavior changes by version, so always confirm the latest specs in the [official documentation](https://developer.hashicorp.com/terraform/language) before going to production. The code is arranged in a form usable in real operation, but credentials and state bucket names are assumed to be in environment variables / backend configuration (never hardcode).

> **Cost optimization (FinOps) is a separate article.** Tag design, taking stock of wasteful resources, budget guardrails, and other "how to keep costs down with Terraform" topics are split into [Optimizing startup costs with Terraform (FinOps)](/blog/aws-terraform-startup-cost-optimization-finops). This article focuses on **module structure and state operations**—that is, "the skeleton of IaC that doesn't break." The two are complementary.

---

## 0. Mental model: a good module, and state as "the reality of production"

Before getting into design, let me fix two mental models running through this article. If these waver, the module will later bloat and the state will have accidents.

**① A good module = "one reason to change, clear inputs/outputs, reusable"**

The official documentation defines a module's purpose as "combining the resource types a provider offers to **describe new architectural concepts and raise the level of abstraction**." And it warns clearly: **"don't make modules that are thin wrappers around single other resource types."**

| | Good module | Bad module |
| --- | --- | --- |
| Responsibility | One reason to change (SRP) | "Network, DB, and IAM"—everything bundled |
| Inputs/outputs | Explicit and minimal via `variable` / `output` | Can't tell what to pass to make it work without reading it |
| Structure | Flat (one level of child modules) | Deep nesting; modules spawning modules |
| Reuse | Can carry to another env / another project | Fused to this project only |
| Abstraction | Concept units like "VPC" or "Cloud Run service" | A thin wrapper around one resource (worthless) |

**② State = a snapshot of "the reality of production"**

Terraform's state is a ledger mapping code to actual infrastructure. Think of it as **the reality of production itself.** That's exactly why—**split it per environment, lock it to prevent simultaneous-update accidents, and separate concerns so you don't create drift (divergence between code and reality).** IaC that treats state lightly will eventually fall into "you can't tell what production looks like even by reading the code."

With these two as axes, we descend into specifics in the order: extraction criteria → standard structure → composition → state isolation → CI gates → drift detection.

---

## 1. Module basics: the structure the official docs define

### 1.1 Root module and child modules

The official definition is clear.

- **Root module**: the `.tf` files directly under the working directory. The starting point where you run the `terraform` command.
- **Child module**: a module called from the root via a `module` block.

A module's source (`source`) can be loaded from a local path, the Terraform Registry, a Git repository, S3, a private registry, and more. The minimal invocation is this.

```hcl
module "consul" {
  source  = "hashicorp/consul/aws"
  version = "0.1.0"

  servers = 3
}
```

`source` is where the module is fetched from, `version` pins which version, and the rest of the arguments (here `servers`) are **that module's input variables.**

### 1.2 How to write the source (official-compliant)

`source` has a format determined by where it comes from. **Writing this ambiguously breaks reproducibility**, so follow the official examples precisely.

```hcl
# ① ローカルパス（必ず ./ か ../ で始める）
module "app_cluster" {
  source = "./app-cluster"
}

# ② Terraform Registry（NAMESPACE/NAME/PROVIDER）+ version 固定
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "6.0.1"
}

# ③ 汎用 Git（必ず ?ref でタグ/SHA を固定する）
module "vpc_git" {
  source = "git::https://example.com/vpc.git?ref=v1.2.0"
}

# ④ Git + SHA-1 ハッシュ固定（最も厳密）
module "storage" {
  source = "git::https://example.com/storage.git?ref=51d462976d84fdea54b47d80dcabbf680badcdb8"
}
```

The iron rule here is **"always pin the version."** The `version` argument is **for Registry modules only.** You can't attach `version` to a Git source, so instead pin it with `?ref=v1.2.0` (a tag) or `?ref=<SHA>`. You can pass `ref` "a value `git checkout` accepts (a branch, a SHA-1, a tag)," but **you must not make a mutable branch like `main` the ref of a production module.** Yesterday's `apply` and today's `apply` would pull different things, and reproducibility vanishes.

### 1.3 Inputs, outputs, references

Ideally, a child module behaves like a pure function: it "receives inputs (`variable`), creates resources, and returns outputs (`output`)."

```hcl
# 呼び出し側：入力を引数で渡す
module "ec2_instance" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "6.0.2"

  name          = "example-instance"
  ami           = data.aws_ami.latest_amazon_linux.id
  instance_type = "t2.micro"
}

# 別リソースから出力を参照：module.<LABEL>.<output>
resource "aws_security_group_rule" "example" {
  security_group_id = module.consul.security_group_id
}
```

Referencing an output is the fixed syntax `module.<LABEL>.<output>`. **What you can touch from outside a module is only the values you exposed via `output`**—this is the true nature of "clear inputs/outputs" as a module quality. A design where you touch internal resources directly from outside is a sign that encapsulation is broken.

### 1.4 Module meta-arguments: `for_each` / `count`

When you want to instantiate the same module multiple times, just like resources, `count` / `for_each` are usable on module blocks too.

```hcl
locals {
  instance_configs = {
    web = { instance_type = "t2.micro" }
    job = { instance_type = "t3.small" }
  }
}

# for_each：キーごとに「異なる設定」で複数モジュールを生成
module "ec2_instance" {
  source   = "terraform-aws-modules/ec2-instance/aws"
  version  = "6.0.2"
  for_each = local.instance_configs

  name          = each.key
  instance_type = each.value.instance_type
}
```

```hcl
# count：同型を「個数」で増やす
module "ec2_instance" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "6.0.2"
  count   = length(local.instance_names)

  name          = local.instance_names[count.index]
  instance_type = "t2.micro"
}
```

**When to use which**: if the elements are "a set identifiable by key," use `for_each` (deleting one element in the middle doesn't cause the others to be recreated due to index shifting); if they're just "N of them," use `count`. In production, make `for_each` your first choice. `count` has the trap that deleting a middle element of a list replaces everything after it.

Besides these, on module blocks you can also use `providers` (assigning alternative provider configurations) and `depends_on` (explicit dependency).

---

## 2. When to extract a module: drawing the line with YAGNI and SRP

This is the point most prone to misjudgment in Terraform design. **"Modularize just in case" is a textbook technical debt.** Let me drop the official warnings—"don't make thin wrappers around single resources," "keep the module tree as flat as possible"—into a practical decision table.

### 2.1 Extract / don't-extract decision table

| Situation | Decision | Rationale |
| --- | --- | --- |
| The same resource group is repeated in **3+ places** | **Extract** | DRY. Twice is coincidence, three times is a pattern |
| It represents **one concept** like "VPC" or "Cloud Run service" | **Extract** | Raising abstraction = the official module purpose |
| You want to **reproduce the same configuration** across environments (stg/prod) | **Extract** | The core of reuse and reproducibility |
| It just thinly wraps one `aws_s3_bucket` | **Don't extract** | Officially discouraged (thin wrapper) |
| It's only used in one place so far | **Don't extract (defer)** | YAGNI. Once the duplication becomes real |
| 30 `variable`s line up inside, and you can configure anything | **Don't extract / split** | A sign of "too many responsibilities," not "configurable" |
| Extracting actually makes **the caller harder to read** | **Don't extract** | KISS. If the abstraction's comprehension cost exceeds its value, it's worthless |

**Applying YAGNI**: don't modularize ahead of time on "we might use it in another env in the future." **Excessive module splitting breeds the worst debt of deep nesting and over-abstraction.** Extraction is plenty in time once "the third duplication" or "a clear concept unit" appears.

**Applying SRP**: test whether you can state a module's responsibility in one sentence. "Creates a VPC and subnets" is OK. "Creates a VPC, **and** creates a DB, **and** sets up IAM"—when an "and" appears, it has too many responsibilities. It's a candidate for splitting.

### 2.2 Standard module structure

Following the standard structure the official docs recommend puts the same thing in the same place for everyone (maintainability).

```text
modules/
  cloud-run-service/
    main.tf          # the resources themselves
    variables.tf     # inputs (variable blocks)
    outputs.tf       # outputs (output blocks)
    versions.tf      # required_version / required_providers
    README.md        # usage and input/output description
    examples/        # invocation examples (optional, but recommended in production)
```

File splitting isn't enforced by the tool, but **the agreement of "resources in main, the entrance in variables, the exit in outputs"** dramatically lowers a team's cognitive cost. In my broadcaster project, I codified all of GCP (VPC through Cloud Run, Cloud SQL, Cloud Armor, Secret Manager, Identity Platform, Workflows) with **about 71 modules**, but precisely because all modules align to this same structure, even at 71 you can read them without getting lost. Unifying the structure pays off more the larger the scale.

---

## 3. Composition first: avoid deep nesting

### 3.1 The official guidance: a flat tree + dependency inversion

The official documentation is clear about composition. **"Instead of a deeply nested module tree, use module composition," "keep the module tree as flat as possible (to one level of child modules)."**

Its core technique is **dependency inversion.** In the official words:

> "Rather than a module embedding its own dependencies and creating and managing its own copies, **the root module passes the dependencies in.**"

Let me place a bad design (nesting, fusion) and a good design (composition, inversion) side by side.

```hcl
# ❌ 悪い：consul_cluster が自分でネットワークを内部生成する
#    → ネットワークだけ差し替えたい時にモジュールを書き換える羽目になる
module "consul_cluster" {
  source = "./modules/aws-consul-cluster"
  # 内部で aws_vpc / aws_subnet を勝手に作っている…（隠れた依存）
}
```

```hcl
# ✅ 良い：ネットワークは別モジュールで作り、ID を「入力として渡す」
module "network" {
  source = "./modules/aws-network"
  # ...
}

module "consul_cluster" {
  source = "./modules/aws-consul-cluster"

  vpc_id     = module.network.vpc_id      # 依存を外から注入
  subnet_ids = module.network.subnet_ids
}
```

### 3.2 Why this works

The advantages the official docs cite are software-design principles themselves.

- **Flexibility (ETC: Easy To Change)**: even if you swap the dependency's origin from "resource → data source," the `consul_cluster` side is unchanged—because it depends on the interface (receiving `vpc_id`).
- **Reusability**: because modules are small and loosely coupled, you can reuse them in different combinations.
- **Clarity**: reading the root module gives you a panoramic view of "which parts connect how."

**Deep nesting is the enemy of ETC.** Nesting `A → B → C → D` creates a "variable bucket relay" where, to pass a value to `D`, you must thread arguments through all of `A → B → C`, and nobody can tell what to change to make what work. **One flat level + composition at the root** is the only structure that doesn't break down as scale grows.

> The official docs also introduce multi-cloud abstraction (a thin abstraction expressing common concepts like "DNS record" or "Kubernetes cluster" with object-type variables), but they caution that you end up **accepting the trade-off of the "lowest common denominator."** Is it really worth unifying by discarding each vendor's powerful specific features?—reconsider it with YAGNI.

---

## 4. State operations ①: environment isolation (stg / prod)

From here is the heart of production operations. **State is the reality of production**—so it needs to be reliably split per environment.

### 4.1 Remote state + locking

According to the official docs, Terraform stores state locally in `terraform.tfstate` by default, but in team operations you place it in a **remote backend** (HCP Terraform / S3 / GCS / Azure Blob, etc.). There are two purposes to going remote.

1. **Sharing**: the whole team sees the same reality.
2. **Locking**: "prevent simultaneous Terraform runs against the same state." It's the lifeline that prevents state corruption and conflicts from simultaneous `apply`.

You can write exactly one backend inside the `terraform` block.

```hcl
# GCS バックエンド（放送事業者案件で採用した GCP 構成）
terraform {
  backend "gcs" {
    bucket = "my-org-tfstate-prod"     # state を置くバケット
    prefix = "platform/infra"          # バケット内のパス
  }
}
```

```hcl
# S3 バックエンド + ネイティブロック（林業DX案件で採用した AWS 構成）
terraform {
  backend "s3" {
    bucket       = "my-org-tfstate-prod"
    key          = "platform/infra/terraform.tfstate"
    region       = "ap-northeast-1"
    encrypt      = true
    use_lockfile = true   # S3 ネイティブのロック（旧来の DynamoDB ロック不要）
  }
}
```

> **A note from the forestry-DX project**: previously, the standard was to pair a DynamoDB table for S3 locking, but I manage state with S3 native locking (`use_lockfile`), preventing simultaneous runs without holding the extra resource of a lock-dedicated table. One fewer resource = one fewer thing to manage, which pays off in a quiet way (cost efficiency, lower operational load).

> **An important official constraint**: a backend block **cannot reference "named values" like variables, locals, or data sources.** A form like `bucket = var.state_bucket` is not allowed. When you want to change values per environment, use the **directory separation** described below, or **partial configuration.**

```bash
# 部分設定：backend ブロックは空にしておき、init 時に値を注入する
terraform init -backend-config="bucket=my-org-tfstate-stg" \
               -backend-config="prefix=platform/infra"
# あるいはファイルで：terraform init -backend-config=stg.backend.hcl
```

### 4.2 Environment isolation: workspaces vs separate state (the most important decision)

On "how to split stg and prod," many teams **reach for workspaces and regret it.** The official guidance is clear, so let me quote it first.

> **"Workspaces are not appropriate for system decomposition or deployments requiring separate credentials and access controls."**

That is, the official docs themselves say **"don't split prod and staging with workspaces."** Let me put it in a decision table.

| Aspect | Workspaces (`terraform workspace`) | Separate state (directory / backend separation) |
| --- | --- | --- |
| Credential separation | Can't (same backend, same auth) | **Can** (a prod bucket, a prod role) |
| Isolation of mishaps | Risk of breaking prod by forgetting to `select` | Directories are physically separate = harder to have accidents |
| Swapping backend config | Not possible (one shared backend) | **Possible** (change bucket/prefix per environment) |
| Access control | Hard to split IAM per environment | **Can assign a least-privilege role per environment** |
| Suited for | **Short-lived parallel work** on the same configuration (trying out a feature branch, light verification) | **Environment isolation including production (stg/prod)** |
| Official stance | **Discouraged** for strong isolation | The **recommended form** for strong isolation |

**Conclusion**: **split stg / prod by separate state (directory separation).** This is official-compliant, and it was my choice in real projects.

```text
envs/
  stg/
    main.tf            # module "platform" { source = "../../modules/..." }
    backend.tf         # backend "gcs" { bucket = "...-tfstate-stg" }
    terraform.tfvars   # stg 固有の値（インスタンスサイズ等）
  prod/
    main.tf            # 同じモジュールを呼ぶ（再現性）
    backend.tf         # backend "gcs" { bucket = "...-tfstate-prod" }
    terraform.tfvars   # prod 固有の値
modules/               # stg/prod が共有する「構成の定義」
  cloud-run-service/
  network/
  ...
```

The point is the form where **`modules/` is one (DRY: the single source of truth for the configuration) / `envs/<env>/` calls it and gives per-environment values (reproducibility).** You can lift a configuration verified in stg straight up to prod, while prod's state, auth, and permissions are physically isolated. Use workspaces only for situations where weak isolation is fine, such as "disposable verification of the same configuration."

```bash
# ワークスペースを使う数少ない場面の例（強い分離が不要なとき）
terraform workspace new feature-x
terraform workspace select feature-x
terraform workspace list
# 構成内では terraform.workspace で現在のワークスペース名を参照できる
```

### 4.3 Passing values between states: terraform_remote_state

When you want to share values across environments or layers (e.g., getting a VPC ID from the network layer's state), the official `terraform_remote_state` data source lets you reference it **read-only.**

```hcl
# 別ステート（ネットワーク層）の出力を読み取る
data "terraform_remote_state" "network" {
  backend = "gcs"
  config = {
    bucket = "my-org-tfstate-prod"
    prefix = "platform/network"
  }
}

resource "google_cloud_run_v2_service" "api" {
  # 別ステートが output しているネットワーク情報を参照
  # （注：network 層側で output "vpc_connector" を公開していること）
  template {
    vpc_access {
      connector = data.terraform_remote_state.network.outputs.vpc_connector
    }
  }
}
```

This makes a **loosely-coupled team split** hold: the core-platform team exposes "only the information OK to share with other teams" via `output`, and the consuming side receives it read-only. The caveat is that **you can only get the values the referenced side exposed via `output`.** Here too, the encapsulation principle of "what you touch from outside is only output" is at work.

---

## 5. State operations ②: prevent drift with separation of concerns

Drift is **the divergence between the state the code declared and the reality of the actual infrastructure.** This is the biggest cause of IaC breaking. And **the biggest source of drift is "mixed responsibilities."**

### 5.1 Don't mix app deployment with infra changes

The most common anti-pattern is **running app deployment (updating container images) and infra configuration changes in the same Terraform.** Do this, and:

- Every time the app team changes the image tag, Terraform runs,
- Unintended infra changes get mixed into that diff,
- and as a result you can no longer trace "who changed what, when."

My solution in the broadcaster project was to **clearly split responsibilities into two systems.**

| Responsibility | Tool in charge | What it manages |
| --- | --- | --- |
| **App delivery** | **Cloud Build** | Building the container image & reflecting the latest env |
| **Infra configuration** | **Terraform** | VPC / Cloud Run definitions / Cloud SQL / Cloud Armor / Secret Manager, etc. |

In other words, I **separated responsibilities as "container images and the latest env are Cloud Build" and "infra configuration is Terraform."** Terraform declares only "the shape of the execution platform" and doesn't get involved with the version of the app running on top. This means the frequently-changing app deployments no longer shake Terraform's state, and **the main cause of drift structurally disappears** (a form of applying SRP to infra operations).

```hcl
# Terraform は「実行基盤の形」を宣言する。動くイメージのタグは Terraform の管理外。
resource "google_cloud_run_v2_service" "api" {
  name     = "content-api"
  location = var.region

  template {
    containers {
      # ❌ ここに :v1.2.3 のような可変タグを直書きすると、
      #    デプロイのたびに Terraform が diff を検知してドリフト源になる
      image = var.container_image  # 値の供給はデリバリ側に委ね、ライフサイクルを切る
    }
  }

  lifecycle {
    # アプリのデリバリ系が更新する属性を Terraform の管理対象から外す
    ignore_changes = [template[0].containers[0].image]
  }
}
```

Declaring "the attributes Terraform won't touch" with `lifecycle { ignore_changes }` is a practice that **guarantees separation of concerns at the code level.** Without this, every time Cloud Build updates the image, Terraform tries to "put it back" and a diff appears forever.

### 5.2 Pin module versions (reproducibility)

The pinning of `version` / `?ref` described through Chapter 3 is also a drift countermeasure. **If a module reference is the `main` branch, the moment the upstream module is updated, an unfamiliar diff wells up in your `plan`**—this too is a kind of drift. Always pin with `version` for the Registry, `?ref=<tag/SHA>` for Git, and do version updates as deliberate PRs.

---

## 6. CI gates: plan(tfsec) → apply(permission-boundary role) → drift detection

Once you've split state and responsibilities, the last step is to move to an operation where **"humans don't `apply` from their laptops."** Local apply is a hole in permissions, auditing, and reproducibility all at once. In my forestry-DX project (AWS), I automated Terraform with 3 gates: **plan (tfsec on PR) / apply (a role with a permission boundary) / drift detection (periodic cron → file an Issue).**

### 6.1 plan + tfsec on PR (the static-check gate)

The moment a PR is raised, visualize the `plan` diff and run a security static check with `tfsec`. **The iron rule is to change nothing here (read-only).**

```yaml
# .github/workflows/terraform-plan.yml（PR時：read-only）
name: terraform-plan
on:
  pull_request:
    paths: ["envs/**", "modules/**"]

permissions:
  contents: read
  id-token: write        # OIDC で鍵レスにロールを引き受ける（後述リンク参照）
  pull-requests: write   # plan 結果を PR にコメント

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC, read-only role)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/tf-plan-readonly
          aws-region: ap-northeast-1

      - uses: hashicorp/setup-terraform@v3

      - run: terraform -chdir=envs/prod init
      - run: terraform -chdir=envs/prod plan -no-color -lock-timeout=60s

      # セキュリティ静的検査（公開バケット・暗号化漏れ等を CI でブロック）
      - name: tfsec
        uses: aquasecurity/tfsec-action@v1.0.0
        with:
          working_directory: .
```

The point is to keep **the plan role read-only.** There's no reason to grant write permissions at the PR stage (least privilege). The `apply` authentication (the mechanism of assuming a role without holding keys via OIDC) is detailed in a separate article, [Keyless CI/CD with GitHub Actions OIDC](/blog/github-actions-oidc-keyless-cicd-aws-gcp-guide). **Don't put long-lived access keys in CI**—this is a non-negotiable premise.

### 6.2 apply on merge (a role with a permission boundary)

Run `apply` only on a merge to `main` (= review-approved). The role used here gets a **permission boundary**, capping "the upper limit of resources Terraform may touch" at the IAM level.

```yaml
# .github/workflows/terraform-apply.yml（main マージ時のみ）
name: terraform-apply
on:
  push:
    branches: ["main"]
    paths: ["envs/**", "modules/**"]

permissions:
  contents: read
  id-token: write

concurrency:
  group: tf-apply-prod   # 同一環境への apply を直列化（ステート競合の予防）
  cancel-in-progress: false

jobs:
  apply:
    runs-on: ubuntu-latest
    environment: production   # GitHub Environments の承認ゲートを噛ませる
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC, boundaried apply role)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          # 権限境界つきの apply 専用ロール。境界外のリソースは触れない。
          role-to-assume: arn:aws:iam::123456789012:role/tf-apply-boundaried
          aws-region: ap-northeast-1

      - uses: hashicorp/setup-terraform@v3
      - run: terraform -chdir=envs/prod init
      - run: terraform -chdir=envs/prod apply -auto-approve -lock-timeout=300s
```

There are three key design points.

1. **Permission-boundary role**: attach a permission boundary to the `apply` role to physically block operations on unexpected services/resources at the IAM level (least privilege). Whatever the Terraform code says, it can't touch outside the boundary.
2. **Serialization**: narrow `apply` against the same environment to one with `concurrency`, and combined with `-lock-timeout`, doubly prevent **state-lock conflicts.**
3. **Approval gate**: with `environment: production`, interpose one human approval, balancing automation and governance.

### 6.3 Periodic drift detection (cron → file an Issue)

Passing CI doesn't make drift zero. **With manual changes in the console, external factors, and events outside the permission boundary**, reality silently diverges. So place a gate that **periodically inspects "the difference between code and reality" and files an Issue when divergence is detected.**

```yaml
# .github/workflows/terraform-drift.yml（定期実行：現実とのズレを検知）
name: terraform-drift
on:
  schedule:
    - cron: "0 0 * * *"   # 毎日。現実との乖離を放置しない
  workflow_dispatch: {}

permissions:
  contents: read
  id-token: write
  issues: write           # ドリフトを検知したら Issue を起票

jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC, read-only role)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/tf-plan-readonly
          aws-region: ap-northeast-1

      - uses: hashicorp/setup-terraform@v3
      - run: terraform -chdir=envs/prod init

      # -detailed-exitcode: 差分なし=0 / 差分あり=2 / エラー=1
      - name: Detect drift
        id: plan
        run: terraform -chdir=envs/prod plan -detailed-exitcode -lock-timeout=60s
        continue-on-error: true

      - name: Open issue on drift
        if: steps.plan.outputs.exitcode == 2
        uses: actions/github-script@v7
        with:
          script: |
            await github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: "⚠️ Terraform drift detected in prod",
              body: "定期 plan が差分(exit code 2)を検知しました。コードと本番の現実がズレています。手動変更の有無を確認し、コードに反映するか revert してください。",
              labels: ["drift", "infra"],
            });
```

The key is **`terraform plan -detailed-exitcode`.** It returns exit 0 if there's no diff, **exit 2 if there's a diff**, and 1 on error, so you can use this to "auto-file an Issue when it drifts." The essence is changing drift from "fix it when you notice" to "a machine finds it and files it every day." A read-only role suffices for detection, just like plan (least privilege).

---

## 7. Production-operations checklist

Let me fold the above into one sheet from an operational view. These are the items I always confirm when productionizing a new Terraform repository.

- **State**: placed in a remote backend (GCS / S3) with locking enabled (`use_lockfile` for S3). `terraform.tfstate` is not committed.
- **Environment isolation**: stg / prod split by **separate state (directory separation).** Production is not split by workspaces (officially discouraged).
- **Separation of concerns**: app delivery (image updates) and infra changes are split. Mutable attributes are taken out of Terraform's management with `ignore_changes`.
- **Modules**: no thin wrappers made. The tree is flat (one level). Dependencies are injected from outside (dependency inversion). Versions pinned with `version` / `?ref`.
- **CI gates**: `plan` + `tfsec` on PR (read-only role). `apply` on merge (permission-boundary role, serialized, approval gate). No local apply.
- **Auth**: CI is keyless via OIDC (no long-lived access keys). The `apply` role is least-privilege + permission boundary.
- **Drift**: detect divergence with periodic `plan -detailed-exitcode` and file an Issue. Not left unattended.

Maintainability (the same structure for everyone), reproducibility (lift stg straight to prod), and governance (trace who changed what with which permissions)—only when these three are together do you have "IaC that doesn't break."

---

## 8. Summary: cheat sheet

Finally, a quick reference for when you're unsure.

- **Extract a module**: on the third duplication, or for a clear concept unit like "VPC" or "Cloud Run service." **Don't make thin wrappers (officially discouraged) / don't get ahead of yourself (YAGNI).**
- **Module structure**: unify on `main.tf` / `variables.tf` / `outputs.tf` / `versions.tf`. The entrance is `variable`, the exit is `output`, and what you touch from outside is only `output`.
- **Composition**: keep the tree flat (one level). Don't create dependencies internally—**inject them from outside** (dependency inversion). Deep nesting = a variable bucket relay = the enemy of ETC.
- **Pin versions**: `version` for the Registry, `?ref=<tag/SHA>` for Git. `main` references forbidden (reproducibility / drift prevention).
- **Environment isolation**: split stg / prod by **separate state (directory + backend separation).** Workspaces are only for "short-lived work that doesn't need strong isolation" (officially discouraged for prod isolation).
- **Locking**: a remote backend + simultaneous-run locking (`use_lockfile` for S3). CI doubly defends with `concurrency` + `-lock-timeout`.
- **Separation of concerns**: split app delivery (Cloud Build, etc.) from infra (Terraform). Mutable attributes get `ignore_changes`. This is the biggest drift prevention.
- **CI gates**: PR = plan + tfsec (read-only) / merge = apply (permission-boundary role, approval) / periodic = drift detection (`-detailed-exitcode` → Issue). Local apply abolished.
- **Auth**: keyless via OIDC (→ [details](/blog/github-actions-oidc-keyless-cicd-aws-gcp-guide)). **Cost optimization is out of scope for this article** (→ [the FinOps edition](/blog/aws-terraform-startup-cost-optimization-finops)).

Terraform is not just a tool to "codify infrastructure." **Cut responsibilities with modules, isolate and lock the reality of production with state, take `apply` away from human laptops with CI, and have a machine watch for drift**—only when you do all this do you get IaC that doesn't break as scale grows.

In the in-house AI platform for a broadcaster, I codified all of GCP with **about 71 modules**, **isolated stg / prod state**, and **prevented drift by splitting responsibilities between Cloud Build and Terraform.** In the forestry-DX project (AWS, a Minister of Economy, Trade and Industry Award-winning product), I manage the infrastructure with **17 modules** and S3 native locking, and automate **plan (tfsec on PR) / apply (permission-boundary role) / drift detection (periodic cron → file an Issue).** With one person × generative AI (Claude Code), I've designed and operated 100% Terraform infrastructure spanning GCP / AWS.

**"How to codify your infrastructure, and how to put it on operations in a form that doesn't break"—from that design through CI setup and drift operations, I can accompany you end-to-end.** Consultations like "our existing Terraform has bloated and we can't touch it" are welcome too. Feel free to reach out.

---

### References (official documentation)

- [Modules Overview — Terraform](https://developer.hashicorp.com/terraform/language/modules) — module definition, root/child, kinds of sources
- [Module Blocks (Use modules) — Terraform](https://developer.hashicorp.com/terraform/language/modules/syntax) — the `module` block, `source`/`version`, `count`/`for_each`, output references
- [Module Sources — Terraform](https://developer.hashicorp.com/terraform/language/modules/sources) — `source` formats for local/Registry/Git, `?ref` pinning
- [Develop Modules — Terraform](https://developer.hashicorp.com/terraform/language/modules/develop) — the standard module structure, "don't make thin wrappers," a flat tree
- [Module Composition — Terraform](https://developer.hashicorp.com/terraform/language/modules/develop/composition) — composition, dependency inversion, the flat-tree guidance
- [Backend Configuration — Terraform](https://developer.hashicorp.com/terraform/language/backend) — `terraform { backend ... }`, the one-backend constraint, no variable references, partial configuration
- [Remote State — Terraform](https://developer.hashicorp.com/terraform/language/state/remote) — remote state, locking, `terraform_remote_state`
- [Workspaces — Terraform](https://developer.hashicorp.com/terraform/language/state/workspaces) — workspace definition, `terraform.workspace`, the official warning that they're discouraged for production isolation