# AWS ECS on Fargate Production Operation Guide: Designing, Deploying, Costing, and Securing Serverless Containers in Real Code

> An ECS on Fargate production operation guide faithful to the AWS official documentation. Systematizes, with Terraform, task-definition JSON, and real code: task-size design (the CPU/memory table), awsvpc networking, rolling updates + deployment circuit breaker, graceful shutdown via SIGTERM, separation of the execution role and the task role, and Fargate Spot and cost optimization.

- Published: 2026-06-26
- Author: 友田 陽大
- Tags: AWS, ECS, Fargate, コンテナ, Terraform, インフラ, コスト最適化, 可観測性
- URL: https://tomodahinata.com/en/blog/aws-ecs-fargate-production-guide
- Category: ECS on Fargate in production

## Key points

- Fargate is a serverless container execution platform that makes server management unnecessary, has an isolation boundary per task, and doesn't share the kernel, CPU, memory, or ENI with other tasks
- The keys to production quality are four: 'separating the execution role and the task role,' 'awsvpc + ALB (target type=ip),' 'rolling update + deployment circuit breaker for automatic rollback,' and 'receiving SIGTERM and finishing cleanly within stopTimeout'
- Task size is a fixed combination of CPU and memory (.25–16 vCPU). Measure and right-size, and horizontally scale with Application Auto Scaling's target tracking is the standard
- Cost is usage-based billing of vCPU-seconds and memory-seconds. Optimize with ARM64 (Graviton), Fargate Spot (running interruption-tolerant workloads at a discount with a 2-minute SIGTERM warning), and Compute Savings Plans
- Inject secrets from Secrets Manager / SSM with valueFrom and consolidate the permission on the execution role. Meet production requirements including non-root execution and readonlyRootFilesystem, break-glass via ECS Exec, and Container Insights

---

"I want to run containers in production. But I don't want to take care of a Kubernetes cluster, and I can't spare time for EC2 patching or scaling either" — when a startup or a solo developer assembles a production container platform, you almost always arrive here. The answer is **AWS Fargate**.

On a lumber-distribution SaaS that won the Minister of Economy, Trade and Industry Award, I've operated **221 API endpoints** in production on top of a configuration of `API Gateway → NLB → ALB → ECS on Fargate`. The worker group of the payment platform ([0 double charges in production](/case-studies/payment-platform-reliability)) also runs on Fargate. Being able to run HTTP services, batches, and event-driven workers with the same mechanism without touching a single server is the foundation for producing production quality with a small team.

This article aims to be **faithful to the [AWS official documentation](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html) yet more understandable than the official docs**, and to show "in which scene how to use it" in real code. It handles end to end what's needed to go to production — task design, networking, deployment, resilience, security, and cost. For the technology selection itself (ECS or EKS), see the separate article [ECS on Fargate vs. EKS: a startup decision framework](/blog/aws-ecs-vs-eks-startup-decision-framework). This article concentrates on **"after choosing ECS on Fargate, how to build it in production."**

---

## What is Fargate: the difference from the EC2 launch type

Amazon ECS (Elastic Container Service) has, broadly, two compute resources (**launch type / capacity**) for running containers.

- **EC2 launch type**: you prepare a fleet of EC2 instances (container instances) yourself and pack tasks onto them. OS patching, instance scaling, and bin-packing (packing efficiency) management are **your responsibility**.
- **Fargate**: a **serverless** method where, just by specifying CPU and memory, AWS prepares, patches, and scales the instances behind it. The concept of a server itself disappears.

The official definition is simple.

> AWS Fargate is a technology that you can use with Amazon ECS to run containers without having to manage servers or clusters of Amazon EC2 instances. (— [AWS Fargate](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/AWS_Fargate.html))

The most important single sentence for security is this.

> Each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.

That is, **each Fargate task has an independent isolation boundary** and shares neither the kernel, CPU, memory, nor ENI (Elastic Network Interface) with other tasks. In the EC2 launch type, multiple tasks share one instance's kernel, but Fargate is "task = the minimal isolation unit." If you have multi-tenant or strict isolation requirements, this becomes a decisive advantage.

### Comparison table: when Fargate and when EC2

| Aspect | Fargate | EC2 launch type |
|------|---------|---------------|
| Server management | **Unnecessary** (patching / AMI updates also on the AWS side) | You operate the OS/AMI/patching yourself |
| Scaling | You only think about the task count | A **two-tier scale** of the instance fleet and tasks |
| Isolation boundary | **Fully isolated per task** | Tasks share the instance's kernel |
| Billing unit | **Usage of vCPU-seconds / memory-seconds** (the allocated amount) | Instance hours (fixed regardless of utilization) |
| Launch speed | Tens of seconds (including ENI allocation) | Fast if there's free space on an instance |
| GPU / special instances | Not supported | Supported (GPU, Inferentia, etc.) |
| Daemon type (one per host) | Not supported (no concept of a host) | DAEMON scheduling possible |
| Cost of constant high load | Can be expensive if utilization is high | Advantageous if high utilization. Also combine Savings Plans |

**Guideline**: when in doubt, Fargate. Because server-management cost (= labor cost) is the biggest cost. The only time you should go back to EC2 is **when there's a clear reason** like "I need a GPU," "there's a batch group constantly running at over 80% CPU and cost is dominant," or "I need an agent resident one-per-host (a DaemonSet equivalent)."

---

## In what scenes to use it: 3 typical workloads

Fargate is not "for web servers only." In practice, its strength is being able to handle the next 3 forms with the same vocabulary (the task definition).

1. **A resident service (Service)**: an HTTP API or web app constantly running behind an ALB/NLB. Make it redundant with `desiredCount` and auto-scale. ← the most common.
2. **A scheduled task (batch)**: daily aggregation, report generation, and data sync run on cron with EventBridge Scheduler. It runs as a **one-off task (RunTask)** rather than a service, and billing stops when it finishes.
3. **An event-driven worker**: asynchronous processing that scales the task count by the SQS queue length. Payment webhook processing, image conversion, etc. Idempotency and graceful shutdown become the essence.

On the payment platform, I put both "a resident API service" and "SQS-driven idempotent workers" on Fargate, and [absorbed with idempotency keys](/blog/dynamodb-payment-reliability-idempotency-zero-downtime) the out-of-order and duplicate arrival of webhooks. **Being able to run all 3 forms on the same deploy platform** dramatically lowers the cognitive load of operation.

---

## The core components: the relationship of the 4 cast members

ECS has many terms and confuses you at first. The essence is just 4.

```
Cluster（論理的な箱：複数サービスをまとめる名前空間）
└── Service（"常にN個のタスクを保つ"宣言＝望ましい状態のコントローラ）
    └── Task（実行中の1単位。1つ以上のコンテナの集合）
        └── Container（あなたのアプリのイメージ）
        ↑
        Task Definition（タスクの設計図：イメージ・CPU/メモリ・IAM・ログ・環境変数）
```

- **Task Definition**: the immutable "blueprint." Versioned by revision number (`my-app:1`, `my-app:2`...). A deploy is **registering a new revision and swapping it into the service**.
- **Task**: the entity launched from a task definition. One task = one ENI (`awsvpc` mode) = one private IP.
- **Service**: a **declarative controller** of "always keep `desiredCount` tasks of this task definition, healthy." If a task dies it auto-restarts, and it takes care of registering/deregistering with the ALB.
- **Cluster**: a logical boundary that bundles services and tasks. In Fargate, there's no "server" in the cluster; it's close to just a namespace.

This declarative model of "the Service keeps maintaining the desired state" is the same idea as Kubernetes's `Deployment`. That's exactly why **declaring the state and entrusting it, rather than `docker run`-ing by hand**, is the correct usage.

---

## Task-size design: CPU and memory are "fixed combinations"

This is Fargate's biggest pitfall. **CPU and memory aren't a free combination; you can only choose from predetermined pairs.** The official combinations ([Task CPU and memory](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html)) are these.

| CPU (whole task) | Selectable memory | Step |
|------------------|------------|------|
| **256 (.25 vCPU)** | 512 MiB / 1 GB / 2 GB | Fixed 3 choices |
| **512 (.5 vCPU)** | 1–4 GB | 1 GB steps |
| **1024 (1 vCPU)** | 2–8 GB | 1 GB steps |
| **2048 (2 vCPU)** | 4–16 GB | 1 GB steps |
| **4096 (4 vCPU)** | 8–30 GB | 1 GB steps |
| **8192 (8 vCPU)** | 16–60 GB | 4 GB steps (PV 1.4.0+) |
| **16384 (16 vCPU)** | 32–120 GB | 8 GB steps (PV 1.4.0+) |

> CPU can be specified as `1024` (CPU units) or `1 vCPU`, and memory as `3072` (MiB) or `3 GB`. They're converted to internal units at registration.

**How it works in practice**: for example, even for a workload where "memory of 512 MiB suffices but I want 1 vCPU of CPU," the moment you choose `1024 CPU`, you secure (= are billed for) **a minimum of 2 GB of memory**. The reverse too — if "8 GB of memory is needed," at least 1 vCPU comes along. So the iron rule is to **decide the size "after measuring."** Take it large on speculation and you keep being billed per second for resources you don't use.

### Ephemeral storage (a temporary disk)

A Fargate task has **20 GB** of ephemeral storage by default. You can use it for build artifacts, temporary files, and caches. If insufficient, you can expand it to **up to 200 GB** with the task definition's `ephemeralStorage` (platform version `1.4.0` and later). It's a volatile area that disappears when the task ends, so if you need persistence, use EFS or S3.

### The platform version is `LATEST` (= Linux 1.4.0)

> The **LATEST** Linux platform version is `1.4.0`. (— [Fargate platform versions](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/platform-fargate.html))

`1.4.0` is needed for ephemeral-storage expansion, `systemControls`, UDP NLB, and the like. **Use `LATEST` unless there's a special reason.** Because a new task always launches with the latest-revision infrastructure (patched), this is the default safe side from a security standpoint too. ARM64 (Graviton) workloads are also supported, and you can specify `ARM64` in `cpuArchitecture` (it pays off in the cost optimization described later).

---

## Networking: the correct way to connect awsvpc and ALB

Fargate is **fixed to the `awsvpc` network mode**. Because each task has a dedicated ENI and private IP, there's no concept of "mapping a host's port" like in EC2. Misunderstand this and you'll definitely get stuck with ALB integration.

**Three important points:**

1. **The ALB's target group is `target_type = "ip"`.** Not `instance`. Because tasks are tied to ENIs rather than EC2 instances, they're registered as IP targets (officially stated).
2. **The security group attaches to the task's ENI.** Minimize it to "only allow ALB's SG → the task's SG (the app's port)." Limit the task's SG inbound to the ALB's SG, and don't open 0.0.0.0/0.
3. **Placement in a private subnet + NAT Gateway** is the production standard. You can also place it directly in a public subnet with `assignPublicIp=ENABLED`, but the attack surface widens. Reach ECR/CloudWatch/Secrets Manager via a VPC endpoint or NAT.

The flow of a request becomes this.

```
Internet → ALB(public subnet) → Target Group(type=ip)
        → Task ENI(private subnet, SG=allow only ALB's SG) → container:8080
                                                   ↘ NAT GW → ECR / Secrets Manager / CloudWatch
```

If you need service discovery (service-to-service communication), use **ECS Service Connect** (or Cloud Map) for DNS-name-based name resolution, keeping internal communication loosely coupled without adding an ALB.

---

## Implementation ①: a minimal task definition (JSON)

First, grasp "what's mandatory" with a minimal task definition. The key points are noted in the comments.

```json
{
  "family": "web-api",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "512",
  "memory": "1024",
  "runtimePlatform": {
    "cpuArchitecture": "ARM64",
    "operatingSystemFamily": "LINUX"
  },
  "executionRoleArn": "arn:aws:iam::111122223333:role/web-api-exec",
  "taskRoleArn": "arn:aws:iam::111122223333:role/web-api-task",
  "containerDefinitions": [
    {
      "name": "app",
      "image": "111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/web-api:1a2b3c4",
      "essential": true,
      "user": "10001:10001",
      "readonlyRootFilesystem": true,
      "linuxParameters": { "initProcessEnabled": true },
      "portMappings": [{ "containerPort": 8080, "protocol": "tcp" }],
      "environment": [{ "name": "NODE_ENV", "value": "production" }],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:ap-northeast-1:111122223333:secret:prod/db-Ab12Cd"
        }
      ],
      "stopTimeout": 60,
      "healthCheck": {
        "command": ["CMD-SHELL", "wget -q -O - http://localhost:8080/healthz || exit 1"],
        "interval": 15,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 30
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/web-api",
          "awslogs-region": "ap-northeast-1",
          "awslogs-stream-prefix": "app"
        }
      }
    }
  ]
}
```

**Points that pay off in production:**

- **Use an immutable reference equivalent to a digest (a CommitSHA tag) for `image`**, not a tag. `latest` breaks reproducibility. By default ECS resolves the tag to a digest with `versionConsistency: enabled`, guaranteeing all tasks in the service run with the same image.
- **Non-root execution with `user` + `readonlyRootFilesystem: true`.** Apply tmpfs with `volumes` only to paths that need writing. The basics of minimizing the attack surface.
- **`secrets[].valueFrom`** injects secrets from Secrets Manager / SSM Parameter Store. Don't write them in plaintext in environment variables (described later).
- **`stopTimeout`** is 30 seconds by default, 120 seconds max. The grace for graceful shutdown (described later).
- **`healthCheck.startPeriod`** gives a grace right after startup, preventing forced termination from false detection during initialization.

---

## Implementation ②: a full production service set with Terraform

A task definition alone isn't production. It becomes "unbreakable, reproducible" only by declaring **the cluster, service, ALB, SG, logs, and auto-scale** with IaC. Let me assemble the set with Terraform (a configuration narrowed to the key points). I leave Terraform module design, state separation, and drift detection to [another article](/blog/terraform-module-design-state-isolation-drift-detection-guide), and concentrate here on the ECS-specific parts.

```hcl
# --- クラスタ：Container Insights を有効化（可観測性の土台） ---
resource "aws_ecs_cluster" "main" {
  name = "prod"
  setting {
    name  = "containerInsights"
    value = "enhanced" # 拡張オブザーバビリティ。コスト許容なら本番推奨
  }
}

# --- タスクのSG：インバウンドは ALB のSGからのみ ---
resource "aws_security_group" "task" {
  name_prefix = "web-api-task-"
  vpc_id      = var.vpc_id
  lifecycle { create_before_destroy = true }
}
resource "aws_vpc_security_group_ingress_rule" "from_alb" {
  security_group_id            = aws_security_group.task.id
  referenced_security_group_id = aws_security_group.alb.id
  ip_protocol                  = "tcp"
  from_port                    = 8080
  to_port                      = 8080
}
resource "aws_vpc_security_group_egress_rule" "all_out" {
  security_group_id = aws_security_group.task.id
  ip_protocol       = "-1"
  cidr_ipv4         = "0.0.0.0/0" # NAT経由でECR/Secrets/CloudWatchへ
}

# --- ALB ターゲットグループ：Fargateは必ず target_type = "ip" ---
resource "aws_lb_target_group" "app" {
  name        = "web-api"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"
  deregistration_delay = 30 # 接続ドレイン。既定300sは過剰なことが多い
  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
    matcher             = "200"
  }
}

# --- サービス：ローリング更新＋デプロイサーキットブレーカー ---
resource "aws_ecs_service" "app" {
  name            = "web-api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2
  launch_type     = "FARGATE"
  platform_version = "LATEST"
  enable_execute_command = true # ECS Exec によるブレークグラス

  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200
  deployment_circuit_breaker {
    enable   = true
    rollback = true # 失敗を検知したら前リビジョンへ自動ロールバック
  }

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.task.id]
    assign_public_ip = false
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }
  health_check_grace_period_seconds = 30
}
```

The combination of `desired_count = 2`, `minimum_healthy_percent = 100`, and `maximum_percent = 200` means a zero-downtime rolling of "**while always keeping 2 tasks healthy, temporarily increase up to 4 tasks to launch the new version, confirm it's healthy, then drop the old version.**" I'll explain it accurately in the next section.

---

## Deployment: rolling updates and the deployment circuit breaker

ECS's default deployment is a **rolling update (the `ECS` type)**. The behavior is decided by two parameters ([official](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/deployment-type-ecs.html)).

- **`minimumHealthyPercent`**: the lower bound of the task count that "must be running healthy" during the deploy (%, rounded up). E.g. min 50%, desired 4, and you can stop up to 2 old-version tasks before launching 2 new-version ones.
- **`maximumPercent`**: the upper bound of the task count you "may launch" during the deploy (%, rounded down). E.g. max 200%, desired 4, and you can launch 4 new-version tasks before stopping 4 old-version ones.

**`min 100% / max 200%`** is the safest side, fully launching the new version without dropping availability at all, then dropping the old version (in exchange, resources temporarily double). If cost-prioritizing, there's also the choice of `min 50% / max 100%` to swap without increasing resources.

### Automatically "rolling back" failures is the circuit breaker

This is the dividing line of production quality. The **deployment circuit breaker** prevents the accident of, without noticing the new version is in a crash loop, flowing traffic to it.

> Both methods support rolling back to the previous service revision. (— official. Both the circuit breaker and CloudWatch alarms support rollback to the previous revision)

Put in `deployment_circuit_breaker { enable = true, rollback = true }`, and if the new task doesn't launch (doesn't pass the health check) a prescribed number of times, it **treats the deploy as failed and automatically reverts to the previous healthy revision.** Further, if you want to base it on the app's business metrics (error rate, etc.), you can also use **CloudWatch alarm linkage**. Enable both and it fails / rolls back the moment either condition is met.

### Version consistency via image digest

By default ECS **resolves the tag to an image digest** and guarantees all tasks in the service run with the same binary (`versionConsistency`). It structurally prevents the accident of "rebuilt and the contents of `latest` changed, so only some tasks are a different thing." Together with the operation of **making CommitSHA the tag and not depending on `latest`**, ensure reproducibility.

CI/CD assembled **keyless with OIDC** is the 2026 standard (don't place long-lived access keys). For specifics, see [keyless CI/CD with GitHub Actions OIDC](/blog/github-actions-oidc-keyless-cicd-aws-gcp-guide). The deploy itself is either registering a new revision and swapping with `aws ecs update-service --force-new-deployment`, or using the `amazon-ecs-deploy-task-definition` action.

---

## Resilience / idempotency: receive SIGTERM and "finish cleanly"

What's most overlooked and most accident-prone in Fargate is **graceful shutdown**. The task stops on every deploy, scale-in, and [Fargate Spot](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html) interruption. At that time, ECS takes the next steps.

1. **Deregister** the task from the ALB target (stop new requests, and wait for in-flight connections during `deregistration_delay`).
2. Send **`SIGTERM`** to the container.
3. Wait **`stopTimeout` (30 seconds by default, 120 seconds max)**.
4. If it still doesn't finish, force-terminate with **`SIGKILL`**.

> The SIGTERM signal must be received from within the container to perform any cleanup actions. Failure to process this signal results in the task receiving a SIGKILL signal after the configured `stopTimeout` and may result in **data loss or corruption**. (— official)

That is, **the app is responsible for catching SIGTERM, handling in-flight requests to completion, and cleanly closing DB connections and queue reception.** Neglect this and on every deploy, in-progress processing is killed with `SIGKILL`, causing data corruption and missed webhooks. In Node.js, you write it like this.

```ts
// graceful-shutdown.ts — SIGTERM を握って in-flight を捌き切る
import http from "node:http";

export function installGracefulShutdown(
  server: http.Server,
  opts: { drainMs: number; onClose: () => Promise<void> },
): void {
  let shuttingDown = false;

  const shutdown = async (signal: NodeJS.Signals): Promise<void> => {
    if (shuttingDown) return; // 二重発火を冪等に無視
    shuttingDown = true;
    console.info({ msg: "shutdown:start", signal });

    // 1) 新規接続を止める。処理中のレスポンスは待つ
    server.close(() => console.info({ msg: "shutdown:http-closed" }));

    // 2) drain 上限を stopTimeout より短く張る（SIGKILL より先に終える）
    const deadline = new Promise<void>((r) => setTimeout(r, opts.drainMs));

    // 3) DB プール・キュー consumer など外部資源を閉じる
    await Promise.race([opts.onClose(), deadline]);
    console.info({ msg: "shutdown:done" });
    process.exit(0);
  };

  process.on("SIGTERM", shutdown); // ECS が送るのはこれ
  process.on("SIGINT", shutdown);  // ローカル Ctrl-C 用
}
```

The iron rule is to set `drainMs` **shorter** than `stopTimeout` (e.g. `drainMs: 50_000` against `stopTimeout: 60`). So as to cleanly `exit(0)` on your own before SIGKILL comes. Put in `linuxParameters.initProcessEnabled: true` and you also avoid the PID 1 zombie-process problem (signals don't propagate correctly).

**The relationship with idempotency**: for an SQS-driven worker, a design of "even if killed midway, don't double-process a redelivered message" is mandatory. This isn't Fargate-specific but a matter of distributed processing in general, and the principles of [idempotent async processing](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide) apply directly. Graceful shutdown (don't miss anything) and idempotency (don't double-process) are **two wheels of a cart.**

---

## Auto-scaling: follow the measured values

Fargate's horizontal scale is assembled with **Application Auto Scaling's Target Tracking**. Declare a target value like "keep CPU utilization at 60%," and it increases `desiredCount` when exceeded and decreases it when below.

```hcl
resource "aws_appautoscaling_target" "app" {
  service_namespace  = "ecs"
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 2
  max_capacity       = 20
}

resource "aws_appautoscaling_policy" "cpu" {
  name               = "cpu-tt"
  policy_type        = "TargetTrackingScaling"
  service_namespace  = aws_appautoscaling_target.app.service_namespace
  resource_id        = aws_appautoscaling_target.app.resource_id
  scalable_dimension = aws_appautoscaling_target.app.scalable_dimension
  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 60.0
    scale_in_cooldown  = 120
    scale_out_cooldown = 30 # 増やすのは速く、減らすのは慎重に
  }
}
```

**Choosing a metric**: if CPU-bound, `ECSServiceAverageCPUUtilization`; for memory, `...MemoryUtilization`. If you want to follow HTTP traffic straightforwardly, **`ALBRequestCountPerTarget`** (requests per target) matches the feel best. Making `scale_out_cooldown` short and `scale_in_cooldown` long is the standard — **increase immediately, decrease cautiously** (preventing "flapping," where you shrink right after a spike and then scramble again).

---

## Observability: a state where you can trace a stuck process at a glance

The smaller the team, the more observability is the lifeline. In Fargate, set up the following from the start.

- **Container Insights**: auto-collect CPU/memory/network/task count. Set it to `enhanced` and you get finer per-container metrics.
- **Logs**: to CloudWatch Logs with the `awslogs` driver. Make them **JSON structured logs** and always pass a correlation ID (request ID). For advanced requirements (multiple destinations, parsing, filtering), use **FireLens (Fluent Bit)** and route to CloudWatch/S3/OpenSearch, etc.
- **Traces**: collect distributed tracing as a sidecar with OpenTelemetry. SRE practice on ECS is detailed in [observability with OpenTelemetry × ECS](/blog/aws-observability-opentelemetry-sre-ecs).

### ECS Exec: "break-glass" into a production container

Without SSH, port opening, or key management, you can **enter a running container to investigate** with ECS Exec.

> in production scenarios, you can use it to gain break-glass access to your containers to debug issues. (— [ECS Exec](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html))

```bash
# サービス/タスクで enableExecuteCommand を有効化した上で：
aws ecs execute-command \
  --cluster prod \
  --task <task-id> \
  --container app \
  --interactive \
  --command "/bin/sh"
```

The mechanism is SSM Session Manager, and **the operation is recorded in CloudTrail**, and you can leave the commands and output to CloudWatch/S3 as audit logs. The task role needs the 4 `ssmmessages:*` actions (`CreateControlChannel`/`CreateDataChannel`/`OpenControlChannel`/`OpenDataChannel`). You can also apply fine governance like **denying Exec only into production containers** with IAM condition keys (`ecs:container-name` etc.). Being able to leave "who, when, into which task" is the auditability SSH doesn't have.

---

## Security: don't confuse the execution role and the task role

This is the point **most often gotten wrong** in Fargate production. There are two kinds of IAM roles, and their roles are completely different.

| Role | Who uses it | What for | Typical permissions |
|--------|---------|---------|------------|
| **Execution role** (`executionRoleArn`) | **The ECS/Fargate agent** | "To launch" the task | Pull the image from ECR, write to CloudWatch Logs, fetch and inject secrets from Secrets Manager/SSM |
| **Task role** (`taskRoleArn`) | **Your app code** | To call AWS APIs while running | S3 read/write, DynamoDB, SQS send/receive, etc. — **the least privilege the app needs** |

The official distinction is clear.

> The permissions granted in the IAM role are vended to containers running in the task. This role allows your application code to use other AWS services. (— the task role)
> These permissions aren't accessed by the Amazon ECS container and Fargate agents. For the IAM permissions that Amazon ECS needs to pull container images and run the task, see Amazon ECS task execution IAM role. (— the difference from the execution role)

**Principles**:
- **Consolidate secret fetching on the execution role** (because the resolution of `secrets[].valueFrom` is done by the agent at launch). If the app directly hits Secrets Manager while running, attach that to the task role.
- **The app's AWS access is the task role.** Create a **dedicated role** per service / task definition, and make it least privilege. "A do-anything role common to all tasks" is the biggest anti-pattern.
- Because each task has an independent isolation boundary in Fargate, the problem of "co-located tasks' credentials leaking," like an EC2 instance profile, structurally doesn't easily happen.

### Don't put secrets in plaintext in environment variables

Write a DB password in plaintext in `environment` and it leaks to everyone who can `DescribeTaskDefinition`. **Always inject via `secrets`** from Secrets Manager or SSM Parameter Store, and grant the execution role `secretsmanager:GetSecretValue` (and KMS decryption permission) at the minimum scope. This is an extension of "put secrets in env, don't put them in code," consistent with this portfolio's [root convention](/blog/typescript-type-safety-discipline-zod-nevererror-no-any).

### Three more to harden

- **Non-root execution** (`user: "10001:10001"`) + **`readonlyRootFilesystem: true`**. Apply tmpfs only where writing is needed.
- **Image scanning**: stop known vulnerabilities before shipping with ECR scanning (enhanced scanning / Inspector integration).
- **Keep the task definition minimal**: `privileged` and host-system sharing are impossible in Fargate in the first place. Don't add unnecessary `linuxParameters`.

For boundary defense including WAF and defense-in-depth, see [AWS WAF defense-in-depth](/blog/waf-defense-in-depth-aws-waf-cloud-armor-owasp-guide).

---

## Cost optimization: lean usage-based billing toward "only what you used"

Fargate is **per-second billing against the allocated vCPU and memory** (minimum 1 minute). Rather than "buying an instance and diluting it by utilization" like EC2, **you're billed for the allocated amount itself while the task is running**. So the direction of optimization is clear.

1. **Right-sizing**: look at actual usage in Container Insights and trim excessive allocation. As mentioned, CPU/memory are fixed combinations, so choose the minimal pair on the premise that "raise one and its partner also rises."
2. **Lean toward ARM64 (Graviton)**: just by making it `cpuArchitecture: ARM64`, you can run equivalent performance at a **unit price about 20% lower than x86** (a multi-arch build is needed). The more CPU-bound the resident service, the more it pays off.
3. **Fargate Spot**: run interruption-tolerant workloads (batches, stateless workers, dev environments) at a **big discount**. The cost is that "when AWS says to return the capacity, **it's interrupted with a 2-minute warning (SIGTERM)**." If you've implemented graceful shutdown, this is a sufficiently acceptable trade-off.
4. **Compute Savings Plans**: commit the baseline portion of constantly-running production services for 1 year / 3 years to lower the unit price. Combine with Spot and split layers as "the base is a discount commitment, the burst is on-demand/Spot."

### Capacity-provider strategy: safely mix Spot with base and weight

Mix on-demand and Spot with a **capacity-provider strategy** ([official](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/fargate-capacity-providers.html)).

- **`base`**: the **minimum task count to secure** with that provider (settable on only one provider, default 0).
- **`weight`**: after base is satisfied, **at what ratio to allocate** additional tasks to each provider.

```hcl
# 「最低2タスクは必ずオンデマンドで確保。それを超える分は Spot:オンデマンド = 4:1 で割る」
default_capacity_provider_strategy {
  capacity_provider = "FARGATE"
  base              = 2
  weight            = 1
}
default_capacity_provider_strategy {
  capacity_provider = "FARGATE_SPOT"
  base              = 0
  weight            = 4
}
```

With this, you can **procure the scale-out portion cheaply while protecting the availability baseline with on-demand**. On a Spot interruption, the service scheduler looks at free capacity and automatically tries to launch another task (if capacity is exhausted, it waits until recovery). The interruption also flows to EventBridge's task-state-change event as a `SpotInterruption`, so put it on monitoring.

The overall FinOps thinking (tags, budget alerts, continuous waste reduction) is summarized in [AWS startup cost optimization](/blog/aws-terraform-startup-cost-optimization-finops).

---

## Pre-production-release checklist

The items I always confirm before shipping.

- [ ] **Task size** is based on measured values. Did you choose the minimum from the fixed CPU/memory pair
- [ ] `platform_version` is `LATEST` (= 1.4.0). Did you consider whether you can make `cpuArchitecture` ARM64
- [ ] **`awsvpc` + ALB `target_type=ip`.** The task SG receives only from the ALB's SG (you don't open 0.0.0.0/0)
- [ ] Private-subnet placement. `assign_public_ip=false`, outbound via NAT/VPC endpoint
- [ ] **Separate the execution role and the task role.** The task role is service-dedicated & least privilege
- [ ] **Secrets are injected with `secrets[].valueFrom`.** You don't place plaintext in `environment`
- [ ] **Non-root execution + `readonlyRootFilesystem`.** Passes ECR image scanning
- [ ] The image is a **CommitSHA tag** (not depending on `latest`). `versionConsistency` is in effect
- [ ] **Rolling update + `deployment_circuit_breaker {rollback=true}`** enabled
- [ ] **SIGTERM handling** implemented with `drainMs < stopTimeout`. `initProcessEnabled: true`
- [ ] `startPeriod` in `healthCheck`, `health_check_grace_period_seconds` on the service
- [ ] **Application Auto Scaling** (target tracking) with `min/max` and asymmetric cooldown set
- [ ] **Container Insights** enabled, structured logs + correlation ID, break-glass possible with `enable_execute_command`
- [ ] Cost: **Spot + capacity provider strategy** (base/weight) / layer-splitting with Savings Plans

---

## Summary: Fargate is a tool for "erasing servers and concentrating on production quality"

The essence of ECS on Fargate is being able to **erase the biggest cost — server management (labor cost) — and concentrate on production quality itself.** The key points this article pinned down for that were four.

1. **Design**: CPU/memory are fixed pairs. Measure and right-size, and connect correctly with `awsvpc` + ALB (`target_type=ip`).
2. **Deployment**: rolling update + **automatic rollback with the circuit breaker**. Ensure version consistency with the digest.
3. **Resilience**: **catch SIGTERM and finish cleanly within `stopTimeout`.** Prevent missed and double processing as two wheels with idempotency.
4. **Safety and cost**: **separate the execution role and the task role** at least privilege. Lean usage-based billing toward "only what you used" with ARM64, Spot, and Savings Plans.

With this pattern, I've operated an award-winning SaaS with 221 endpoints and a payment platform with 0 double charges in production, with a small team. Even with **one person × generative AI**, as long as you don't break the pattern faithful to the official documentation, you can reproduce world-class robustness. When you want to put your product's container platform into production fast, cheap, and safely, please consult me.
