# ECS on Fargate Troubleshooting Complete Guide: Diagnosing and Fixing Why Tasks Won't Start or Die Immediately, by Stop-Reason Code

> A practical guide to systematically diagnosing and fixing ECS Fargate task stop reasons (CannotPullContainerError, OutOfMemory, health-check failure, etc.) by stop code, from how to read describe-tasks.

- Published: 2026-06-26
- Author: 友田 陽大
- Tags: AWS, ECS, Fargate, トラブルシューティング, 可観測性, デバッグ, コンテナ, 運用
- URL: https://tomodahinata.com/en/blog/aws-ecs-fargate-troubleshooting-task-stopped-reasons-guide
- Category: ECS on Fargate in production
- Pillar guide: https://tomodahinata.com/en/blog/aws-ecs-fargate-production-guide

## Key points

- The cause of a task not starting / dying always appears in stoppedReason / stopCode / exitCode. The first move is to confirm all fields with aws ecs describe-tasks
- 80% of CannotPullContainerError is either insufficient network reachability in a private subnet (lack of a VPC endpoint or NAT) or a missing ECR permission on the execution role
- The execution role (executionRoleArn) is used by the ECS agent. The task role (taskRoleArn) is used by the app. Confuse these two and secrets/pull/log break in a chain
- Most ELB health-check failures are insufficient startPeriod / health_check_grace_period_seconds and forgetting target_type=ip. A missing SG allow is easily overlooked
- ECS Exec (execute-command) is a break-glass means recorded in CloudTrail via SSM Session Manager. enableExecuteCommand and the 4 ssmmessages actions are mandatory

---

"After deploying, the task won't start," "it starts but dies immediately," "it never passes the ALB health check"—problems you'll surely hit at least once operating ECS on Fargate.

On a lumber-distribution B2B SaaS, I put 221 endpoints on a configuration of `API Gateway → NLB → ALB → ECS on Fargate`, and ran the worker group of the payment foundation ([zero double charges in production](/case-studies/payment-platform-reliability)) on the same foundation. Task-stop problems are crushed not by "operational carefulness" but by **a diagnostic pattern and structural prevention.** Before staring at CloudWatch and feeling "something's off," the shortest route is to **logically narrow the cause from `stoppedReason` and `stopCode`.**

This article **systematizes ECS Fargate's task stop reasons by category**, showing end-to-end "where to look → what's the cause → how to fix → how to prevent recurrence." Reading the [ECS on Fargate production-operations guide](/blog/aws-ecs-fargate-production-guide) first for the whole picture of production design deepens your understanding.

---

## Where to look first: the starting point of diagnosis

From the moment you notice "a task died," there are 4 places to check.

| Place to check | What it tells you |
|---------|------------|
| The console's "Stopped tasks" tab | A summary of stoppedReason, exitCode, the final status |
| `aws ecs describe-tasks` | Structured data of all fields (the most detailed) |
| CloudWatch Logs (awslogs) | Logs the app wrote itself (panics, startup failures, etc.) |
| The EventBridge "ECS Task State Change" event | Async stop notification. The starting point for alert integration / automated response |

**The first move is this, no other choice.**

```bash
aws ecs describe-tasks \
  --cluster prod \
  --tasks <task-id> \
  --query 'tasks[0].{lastStatus:lastStatus,stoppedReason:stoppedReason,stopCode:stopCode,containers:containers[*].{name:name,reason:reason,exitCode:exitCode,lastStatus:lastStatus}}' \
  --output json
```

---

## How to read describe-tasks: always confirm 5 fields

```json
{
  "lastStatus": "STOPPED",
  "stoppedReason": "Essential container in task exited",
  "stopCode": "EssentialContainerExited",
  "containers": [
    {
      "name": "app",
      "lastStatus": "STOPPED",
      "exitCode": 1,
      "reason": ""
    }
  ]
}
```

| Field | Meaning | Point to note |
|-----------|------|------------|
| `lastStatus` | The task's current overall state | `STOPPED` is confirmed |
| `stoppedReason` | Human-facing stop-reason text | You can read the category like "CannotPullContainerError" or "Your Spot Task was interrupted." |
| `stopCode` | A machine-judgeable stop code | Used for EventBridge filters and alerts |
| `containers[].exitCode` | The container process's exit code | 0=normal, 1=app error, 137=OOMKill, null=hasn't even reached startup |
| `containers[].reason` | The container's individual reason | Details for pull errors etc. appear here |

**When exitCode is null**, the failure is on the Fargate-agent side before the app even started (CannotPullContainer / ResourceInitializationError, etc.). Before chasing the app's logs, crush the agent-side cause.

---

## Diagnostic flowchart

```text
A task became STOPPED
        |
        +-- "CannotPullContainer" in stoppedReason ?
        |       YES → go to § CannotPullContainerError
        |
        +-- "ResourceInitializationError" in stoppedReason ?
        |       YES → go to § ResourceInitializationError
        |
        +-- stopCode = "EssentialContainerExited" ?
        |       YES → check containers[].exitCode
        |               exitCode=137 → go to § OutOfMemoryError
        |               exitCode≠0, non-null → go to § Essential container exited
        |               exitCode=null → go to § CannotStartContainerError
        |
        +-- "health check" / "ELB" in stoppedReason ?
        |       YES → go to § ELB health-check failure
        |
        +-- stopCode = "SpotInterruption" ?
        |       YES → go to § Spot interruption (not a failure)
        |
        +-- "timeout" in stoppedReason ?
                YES → go to § ContainerRuntimeTimeoutError
```

---

## CannotPullContainerError: the image can't be pulled

### Symptom

The task stops at `PROVISIONING` → `PENDING`, and `stoppedReason` shows the following.

```text
CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref ...
```

exitCode is null (hasn't reached app startup).

### Per-cause checklist

**① Lack of VPC endpoints / NAT (most common)**

A configuration in a private subnet with `assignPublicIp=false`, yet none of the following exists.

- A NAT Gateway (the outbound route)
- VPC endpoints for ECR (`com.amazonaws.<region>.ecr.api` + `com.amazonaws.<region>.ecr.dkr`)
- An S3 gateway endpoint (mandatory because ECR's layers are stored in S3)

```bash
# VPCエンドポイント一覧を確認
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=<vpc-id>" \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State}'
```

The minimum set needed (private-subnet configuration):

| Endpoint | Type | Use |
|--------------|------|------|
| `ecr.api` | Interface | Resolving the task definition's image URL |
| `ecr.dkr` | Interface | Pulling layers (Docker Registry API) |
| `s3` | Gateway | The substance of ECR layers (stored in S3) |
| `logs` | Interface | The awslogs driver (CloudWatch Logs) |
| `secretsmanager` | Interface | secrets valueFrom (if used) |
| `ssmmessages` | Interface | ECS Exec (if used) |

**② The execution role has no ECR permission**

`executionRoleArn` doesn't have `AmazonECSTaskExecutionRolePolicy` attached, or a custom policy is insufficient.

```json
{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken",
    "ecr:BatchCheckLayerAvailability",
    "ecr:GetDownloadUrlForLayer",
    "ecr:BatchGetImage"
  ],
  "Resource": "*"
}
```

> Note: `ecr:GetAuthorizationToken` doesn't function unless the resource is `*`.

**③ Wrong tag/digest**

The specified tag doesn't exist in ECR, or the digest has changed.

```bash
# ECR でタグ一覧を確認
aws ecr describe-images \
  --repository-name web-api \
  --query 'imageDetails[*].{Tags:imageTags,Pushed:imagePushedAt}' \
  --output table
```

**④ Docker Hub rate limit**

If you use a Docker Hub official image (like `node:20`) directly, you may hit the anonymous-pull rate limit. Switch to an operation that mirrors to ECR Public or ECR and pulls from there.

### Fix flow

```text
1. プライベートサブネット構成か確認
   YES → VPCエンドポイント(ecr.api / ecr.dkr / s3)を追加
         または NAT Gateway を確認

2. 実行ロールのポリシーを確認
   → AmazonECSTaskExecutionRolePolicy がアタッチされているか

3. イメージURIを確認
   → ECR にそのタグが存在するか describe-images で確認

4. SGを確認
   → VPCエンドポイントのSGがタスクのSGからの443を許可しているか
```

### Prevention

Manage VPC endpoints as code with Terraform, and detect a missing route at the `Plan` stage. Do ECR pushes in the CI pipeline, and confirm the tag's existence right after push before triggering the deploy.

---

## ResourceInitializationError: resource-initialization failure

### Symptom

An error in the early phase of task startup. `stoppedReason` has a message like the below.

```text
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: ...
```

Or:

```text
ResourceInitializationError: failed to configure ENI: ...
```

### Cause and diagnosis

ResourceInitializationError sits in **a position close to the parent category of CannotPullContainer.** It occurs across the whole "pre-startup initialization phase" where the Fargate agent does network configuration, secret retrieval, and log initialization.

**Main causes:**

| Cause | Checkpoint |
|------|--------------|
| Secret-retrieval failure (Secrets Manager / SSM unreachable) | Presence of the VPC endpoints `secretsmanager` / `ssm` |
| The execution role has no `GetSecretValue` / `GetParameter` permission | Check the execution role's policy |
| Log-init failure (CloudWatch Logs unreachable) | Presence of the VPC endpoint `logs`, `logs:CreateLogStream` on the role |
| ENI-allocation failure (subnet IP exhaustion) | Confirm the subnet's free IP count with `describe-subnets` |
| The SG blocks outbound | Check the task SG's outbound rules |

```bash
# サブネットの空きIPを確認
aws ec2 describe-subnets \
  --subnet-ids <subnet-id> \
  --query 'Subnets[*].{CIDR:CidrBlock,Available:AvailableIpAddressCount}'
```

### Confusing the execution role and the task role generates chain errors

This is the most overlooked pattern. It's detailed in the [ECS on Fargate production-operations guide](/blog/aws-ecs-fargate-production-guide) too, but let me organize it again.

| Role | Config key | Used by | Typical permissions |
|--------|---------|---------|------------|
| **Execution role** | `executionRoleArn` | The ECS agent (at startup) | ECR pull, CloudWatch Logs write, Secrets Manager retrieval |
| **Task role** | `taskRoleArn` | App code (during execution) | App-specific AWS resources like S3, DynamoDB, SQS |

Secret injection (`secrets[].valueFrom`) is retrieved by the agent at startup, so the permission goes on the **execution role.** Only when the app calls `aws secretsmanager get-secret-value` during execution does it go on the task role. Not keeping this principle manifests as ResourceInitializationError.

---

## OutOfMemoryError: memory-limit overrun

### Symptom

The task's `stoppedReason` has:

```text
OutOfMemoryError: Container killed due to memory usage
```

or `containers[].exitCode = 137` (Linux's OOMKill is exit code 137).

### Cause and diagnosis

Fargate's memory limit is set at the container level. When a container's memory usage exceeds the task definition's `memory` (the upper bound), the Linux kernel's OOM Killer force-terminates the process.

```bash
# Container Insights でメモリ使用量を確認（クエリ例）
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=web-api \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average Maximum
```

**Sorting the cause:**

| Pattern | How to tell | Fix |
|---------|---------|------|
| Underestimated task size | Constant OOMKill. The max reaches the limit | Change the task size to the next pair up |
| Memory leak | Normal right after startup but gradually grows and dies | Investigate with a heap dump/profiler |
| Momentary overrun at spikes | Only in specific time bands or under high load | Set soft (memoryReservation) and hard (memory) limits separately |

The task definition's memory settings have two levels.

```json
{
  "name": "app",
  "memory": 1024,
  "memoryReservation": 512
}
```

- `memory` (hard limit): exceeding this is OOMKill
- `memoryReservation` (soft limit): a guideline for scheduling. Exceeding it isn't an immediate kill

To give spike tolerance, set `memoryReservation` a little above normal usage and `memory` to about 1.5–2× that, with a buffer of room. But **Fargate's task memory upper bound equals `task.memory`, and if the containers' sum exceeds it, it can't start.**

---

## Essential container exited / non-zero exitCode: app-startup failure

### Symptom

```text
Essential container in task exited
```

`stopCode = EssentialContainerExited`, `containers[].exitCode` is non-zero (other than 0).

### Diagnosis procedure

**Step 1: Check CloudWatch Logs (most important)**

```bash
# ログストリームの最新を取得
aws logs get-log-events \
  --log-group-name /ecs/web-api \
  --log-stream-name app/<task-id> \
  --limit 50 \
  --query 'events[*].message'
```

Fargate writes the app's STDOUT/STDERR to CloudWatch Logs via the `awslogs` driver. A record of the app panicking/exiting on "the config file can't be read," "DB connection failed," "the port is already in use," etc. always remains here.

**Step 2: Check for missing environment variables / secrets**

If a secret's ARN referenced by `secrets[].valueFrom` is wrong or the relevant version doesn't exist, the environment variable needed at startup becomes empty and the app crashes.

```bash
# シークレットの存在確認
aws secretsmanager describe-secret \
  --secret-id arn:aws:secretsmanager:ap-northeast-1:111122223333:secret:prod/db-Ab12Cd
```

**Step 3: Check CMD / ENTRYPOINT**

A case where the task definition's `command` is wrong or references a nonexistent path.

```bash
# ローカルでイメージを起動して再現確認
docker run --rm \
  -e DATABASE_URL=<test-url> \
  111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/web-api:latest
```

**A list of common exitCodes:**

| exitCode | Meaning | Direction of the fix |
|---------|------|----------|
| 1 | A general app error | Check the error content in CloudWatch Logs |
| 2 | Misuse of Bash / a shell error | Check CMD/ENTRYPOINT syntax |
| 127 | Command not found | A path / binary-name error |
| 137 | SIGKILL (OOMKill or manual kill) | Memory settings or a StopTask operation |
| 143 | SIGTERM (graceful shutdown accepted normally) | Normal stop. Confirm it's an intentional stop |

---

## Task failed ELB health checks: failure to pass the health check

### Symptom

The task is up but the service is judged unhealthy, and tasks repeatedly swap out. The ECS service's events show the following.

```text
service web-api (port 8080) is unhealthy in target-group arn:... due to (reason Health checks failed)
```

### Diagnosis flow

**① Confirm target_type = "ip" (a Fargate-specific pitfall)**

If the ALB's target group is left as `target_type = "instance"`, Fargate tasks can't be registered correctly.

```bash
aws elbv2 describe-target-groups \
  --query 'TargetGroups[*].{Name:TargetGroupName,TargetType:TargetType}'
```

Always confirm it's `ip`. If it's `instance`, recreate it (can't be changed).

**② Confirm the health-check path and port**

Is the health-check path (e.g. `/healthz`) set on the ALB's target group implemented in the app and returning on the correct port.

```bash
# タスクのENI IPを取得してヘルスチェックを手動実行
TASK_ENI_IP=$(aws ecs describe-tasks \
  --cluster prod --tasks <task-id> \
  --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
  --output text)

# VPC内の踏み台から確認（直接疎通できる場合）
curl -v http://${TASK_ENI_IP}:8080/healthz
```

**③ Is the SG blocking the health check**

Even if the task's SG inbound rule is "allow only the ALB's SG," it won't pass unless it matches **the source of the ALB SG's health-check sends.**

```bash
# タスクSGのインバウンドルールを確認
aws ec2 describe-security-groups \
  --group-ids <task-sg-id> \
  --query 'SecurityGroups[*].IpPermissions'
```

**④ The startPeriod and grace-period settings**

An app that takes time to start (running DB migrations, loading a large model, etc.) gets judged "unhealthy" and killed during init unless you set the grace before the health check starts.

```json
{
  "healthCheck": {
    "startPeriod": 60,
    "interval": 15,
    "timeout": 5,
    "retries": 3
  }
}
```

In addition, the ECS-service side also needs `health_check_grace_period_seconds`.

```hcl
resource "aws_ecs_service" "app" {
  # ...
  health_check_grace_period_seconds = 60
}
```

`healthCheck.startPeriod` (the health-check grace of the container in the task definition) and `health_check_grace_period_seconds` (the grace during which the ECS service ignores ALB health-check failures) are **different things.** The grace after being registered with the ALB is set by the latter.

**Health-check diagnosis checklist:**

- [ ] Is `target_type = "ip"`
- [ ] Does the health-check path return `200` in the app
- [ ] Does the health-check port number match the container port
- [ ] Does the task SG allow inbound from the ALB's SG
- [ ] Is the task definition's `healthCheck.startPeriod` set
- [ ] Is the ECS service's `health_check_grace_period_seconds` set

---

## SpotInterruption: a Spot interruption is not a "failure"

### Symptom

```text
stopCode: "SpotInterruption"
stoppedReason: "Your Spot Task was interrupted."
```

### The correct understanding

A Spot interruption is a **designed behavior** for AWS to reclaim capacity, not a software bug or an operational mistake. What you should address is "is it designed not to break when an interruption occurs."

**How a Spot interruption works:**

1. AWS decides to reclaim capacity (usually 2 minutes before the notification)
2. EventBridge fires an `ECS Task State Change` event (stopCode=SpotInterruption)
3. **SIGTERM** is sent to the task (with the `stopTimeout` grace, max 120 seconds)
4. SIGKILL after the grace

**The 3-piece set to absorb it by design:**

```text
① グレースフルシャットダウン：SIGTERMを受けてin-flightを捌き切る
② 冪等な処理：途中でKillされても、再起動後に二重処理しない
③ 容量プロバイダ戦略：baseをオンデマンドで守り、追加分をSpotに割り当て
```

```hcl
# Spot 中断をEventBridgeで検知してアラートへ
resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "ecs-spot-interruption"
  event_pattern = jsonencode({
    source      = ["aws.ecs"]
    detail-type = ["ECS Task State Change"]
    detail = {
      stopCode = ["SpotInterruption"]
    }
  })
}
```

For the graceful-shutdown implementation pattern, see the "end cleanly on SIGTERM" section of the [ECS on Fargate production-operations guide](/blog/aws-ecs-fargate-production-guide). `stopTimeout`'s default is 30 seconds and the max is 120 seconds. In production, set it explicitly to match the app's drain time.

---

## CannotStartContainerError / ContainerRuntimeTimeoutError

### Symptom

```text
CannotStartContainerError: ...
ContainerRuntimeTimeoutError: Timeout waiting for container to start
```

exitCode is null (hasn't reached container-process startup).

### Cause and diagnosis

**CannotStartContainerError:**

- A non-executable binary specified in `command` / `entryPoint`
- Can't write to the `volumes` mount target (especially with `readonlyRootFilesystem: true`)
- A UID specified by `user` doesn't exist in the container
- A `linuxParameters` dependency setting unsupported on Fargate

```bash
# ローカルで同じ設定を再現
docker run --rm \
  --user 10001:10001 \
  --read-only \
  --tmpfs /tmp \
  my-image:tag
```

**ContainerRuntimeTimeoutError:**

The container's startup sequence (the start of running ENTRYPOINT/CMD) times out. Such as when the wait for a connection to a dependency service is too long. If the init at startup is waiting on an external service, consider extending the health check's `startPeriod` or decoupling external dependencies from the startup sequence.

---

## ECS Exec: investigate inside a running container

When a task is running (or during debugging right after startup), what lets you enter directly into the container to investigate is **ECS Exec.** No SSH, no port opening, no key management needed.

### Prerequisites

**① `enableExecuteCommand: true` on the ECS service / task**

```hcl
resource "aws_ecs_service" "app" {
  enable_execute_command = true
  # ...
}
```

Note: `enableExecuteCommand` is **effective only for newly-started tasks.** It can't be retrofitted to existing tasks. After changing the setting, swap tasks with force-new-deployment.

**② The 4 ssmmessages actions on the task role**

```json
{
  "Effect": "Allow",
  "Action": [
    "ssmmessages:CreateControlChannel",
    "ssmmessages:CreateDataChannel",
    "ssmmessages:OpenControlChannel",
    "ssmmessages:OpenDataChannel"
  ],
  "Resource": "*"
}
```

This is a permission to put on the **task role** (`taskRoleArn`), not the execution role.

**③ The SSM Session Manager plugin (client side)**

```bash
# macOS
brew install session-manager-plugin
```

### The actual commands

```bash
# タスクIDを確認
TASK_ID=$(aws ecs list-tasks \
  --cluster prod \
  --service-name web-api \
  --query 'taskArns[0]' \
  --output text | awk -F/ '{print $NF}')

# コンテナに入る
aws ecs execute-command \
  --cluster prod \
  --task ${TASK_ID} \
  --container app \
  --interactive \
  --command "/bin/sh"
```

Once in the shell, you can directly check environment variables, file existence, network connectivity, and more.

```bash
# コンテナ内で環境変数の確認
env | grep DATABASE

# DB への疎通確認
nc -zv db.internal 5432

# プロセスの確認
ps aux
```

### Auditing ECS Exec

All ECS Exec operations are **recorded in CloudTrail** (as `ExecuteCommand` API calls). Furthermore, configuring `logging: OVERRIDE` lets you save the session's input/output to CloudWatch Logs or S3. For access to production containers, I recommend always enabling this audit log.

For a deeper observability implementation, see [Observability with OpenTelemetry × ECS](/blog/aws-observability-opentelemetry-sre-ecs).

---

## Prevention: stop recurrence with observability

Once you've solved a problem, put in structural prevention so you don't spend time on the same problem again.

### 1. Structured logs + correlation ID

Output all logs as JSON and always include `requestId` / `traceId`. You can then filter in CloudWatch Logs Insights.

```ts
// 構造化ログの最小実装
const log = (level: string, msg: string, ctx: Record<string, unknown> = {}) => {
  process.stdout.write(
    JSON.stringify({ level, msg, timestamp: new Date().toISOString(), ...ctx }) + "\n"
  );
};

// リクエストIDを全ログに通す
log("info", "server:start", { port: 8080, env: process.env.NODE_ENV });
```

### 2. Container Insights + alerts

```hcl
resource "aws_cloudwatch_metric_alarm" "task_stopped" {
  alarm_name          = "ecs-task-stopped-abnormally"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "RunningTaskCount"
  namespace           = "ECS/ContainerInsights"
  period              = 60
  statistic           = "Minimum"
  threshold           = 0
  alarm_description   = "全タスクが停止した（desired > 0 にもかかわらず）"
  dimensions = {
    ClusterName = "prod"
    ServiceName = "web-api"
  }
}
```

### 3. Deployment circuit breaker

When new tasks consecutively fail the health check, automatically roll back to the previous revision.

```hcl
resource "aws_ecs_service" "app" {
  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}
```

This alone structurally prevents the accident of "a misdeploy stays stuck in production." Integration with the CI/CD pipeline is detailed in the [ECS on Fargate CI/CD guide](/blog/aws-ecs-fargate-cicd-blue-green-codedeploy-github-actions-guide).

### 4. Always set the health check's startPeriod

```json
{
  "healthCheck": {
    "startPeriod": 30,
    "interval": 15,
    "timeout": 5,
    "retries": 3
  }
}
```

The default without `startPeriod` is 0 seconds. A task gets dropped during a temporary unhealthy state at startup.

### 5. Codify the network design with Terraform

A lack of VPC endpoints is the most common cause of CannotPullContainer / ResourceInitializationError. Manage the patterns from the [ECS on Fargate networking guide](/blog/aws-ecs-fargate-networking-alb-service-connect-vpc-guide) with Terraform and eliminate config differences between environments.

---

## Quick reference: stop reason → where to look first → typical cause → fix

| Stop reason / stopCode | Where to look first | Typical cause | Fix |
|--------------------|----------------|---------|------|
| `CannotPullContainerError` | VPC endpoint list, execution role | Missing NAT/VPC endpoint in a private subnet, insufficient ECR permission on the execution role | Add ECR endpoints or check NAT, grant the policy |
| `ResourceInitializationError` | Execution role, VPC endpoints (secretsmanager/logs/ssm) | Secret unreachable, log-write impossible, ENI-allocation failure | Add permissions to the execution role, add VPC endpoints, check subnet IP exhaustion |
| `EssentialContainerExited` (exitCode≠0) | CloudWatch Logs | App-startup failure, missing env/secrets, wrong CMD | Check the error content in logs, fix env vars and command |
| `EssentialContainerExited` (exitCode=137) | Container Insights memory graph | OOMKill (memory-limit overrun) | Change task memory to the next pair up, investigate a leak |
| `EssentialContainerExited` (exitCode=null) | describe-tasks' containers[].reason | CannotStartContainer (pre-startup failure) | Check the CMD/ENTRYPOINT path and permissions |
| ELB health-check failure | ALB target-group config, task SG | target_type=instance, SG misconfig, insufficient startPeriod | Change to `target_type=ip`, add SG allow, set startPeriod |
| `SpotInterruption` | The EventBridge event | Spot capacity reclamation (normal behavior) | Implement graceful shutdown, confirm idempotent design |
| `ContainerRuntimeTimeoutError` | The task definition's CMD/ENTRYPOINT | Startup processing too long, wait for a dependency service | Optimize the startup sequence, extend startPeriod |
| `CannotCreateVolumeError` | The task definition's volumes, EFS config | EFS mount target unreachable, SG config | Check the EFS endpoint and SG |

---

## Summary

ECS on Fargate troubleshooting is fastest with the pattern of logically narrowing down in the order **`stoppedReason` → `stopCode` → `exitCode` → CloudWatch Logs**, rather than starting from a feeling of "something's off."

From the experience I've accumulated in practice, the vast majority of "task won't start" problems consolidate into these 3.

1. **Network reachability** (a missing VPC endpoint or NAT in a private subnet)
2. **Confusing the execution role and the task role** (a permission on the wrong role)
3. **Insufficient health-check grace** (startPeriod and grace period not set)

The reason I could achieve zero double charges on the payment foundation is that I consistently had a design that **crushes problems in advance with structure and observability**, not "fix it after a failure occurs." If you want to stabilize a Fargate production environment fast and safely, feel free to reach out.
