Skip to main content
友田 陽大
ECS on Fargate in production
AWS
ECS
Fargate
トラブルシューティング
可観測性
デバッグ
コンテナ
運用

ECS on Fargate Troubleshooting Complete Guide: Diagnosing and Fixing Why Tasks Won't Start or Die Immediately, by Stop-Reason Code

A practical guide to systematically diagnosing and fixing ECS Fargate task stop reasons (CannotPullContainerError, OutOfMemory, health-check failure, etc.) by stop code, from how to read describe-tasks.

Published
Reading time
17 min read
Author
友田 陽大
Share

"After deploying, the task won't start," "it starts but dies immediately," "it never passes the ALB health check"—problems you'll surely hit at least once operating ECS on Fargate.

On a lumber-distribution B2B SaaS, I put 221 endpoints on a configuration of API Gateway → NLB → ALB → ECS on Fargate, and ran the worker group of the payment foundation (zero double charges in production) on the same foundation. Task-stop problems are crushed not by "operational carefulness" but by a diagnostic pattern and structural prevention. Before staring at CloudWatch and feeling "something's off," the shortest route is to logically narrow the cause from stoppedReason and stopCode.

This article systematizes ECS Fargate's task stop reasons by category, showing end-to-end "where to look → what's the cause → how to fix → how to prevent recurrence." Reading the ECS on Fargate production-operations guide first for the whole picture of production design deepens your understanding.


Where to look first: the starting point of diagnosis

From the moment you notice "a task died," there are 4 places to check.

Place to checkWhat it tells you
The console's "Stopped tasks" tabA summary of stoppedReason, exitCode, the final status
aws ecs describe-tasksStructured data of all fields (the most detailed)
CloudWatch Logs (awslogs)Logs the app wrote itself (panics, startup failures, etc.)
The EventBridge "ECS Task State Change" eventAsync stop notification. The starting point for alert integration / automated response

The first move is this, no other choice.

aws ecs describe-tasks \
  --cluster prod \
  --tasks <task-id> \
  --query 'tasks[0].{lastStatus:lastStatus,stoppedReason:stoppedReason,stopCode:stopCode,containers:containers[*].{name:name,reason:reason,exitCode:exitCode,lastStatus:lastStatus}}' \
  --output json

How to read describe-tasks: always confirm 5 fields

{
  "lastStatus": "STOPPED",
  "stoppedReason": "Essential container in task exited",
  "stopCode": "EssentialContainerExited",
  "containers": [
    {
      "name": "app",
      "lastStatus": "STOPPED",
      "exitCode": 1,
      "reason": ""
    }
  ]
}
FieldMeaningPoint to note
lastStatusThe task's current overall stateSTOPPED is confirmed
stoppedReasonHuman-facing stop-reason textYou can read the category like "CannotPullContainerError" or "Your Spot Task was interrupted."
stopCodeA machine-judgeable stop codeUsed for EventBridge filters and alerts
containers[].exitCodeThe container process's exit code0=normal, 1=app error, 137=OOMKill, null=hasn't even reached startup
containers[].reasonThe container's individual reasonDetails for pull errors etc. appear here

When exitCode is null, the failure is on the Fargate-agent side before the app even started (CannotPullContainer / ResourceInitializationError, etc.). Before chasing the app's logs, crush the agent-side cause.


Diagnostic flowchart

A task became STOPPED
        |
        +-- "CannotPullContainer" in stoppedReason ?
        |       YES → go to § CannotPullContainerError
        |
        +-- "ResourceInitializationError" in stoppedReason ?
        |       YES → go to § ResourceInitializationError
        |
        +-- stopCode = "EssentialContainerExited" ?
        |       YES → check containers[].exitCode
        |               exitCode=137 → go to § OutOfMemoryError
        |               exitCode≠0, non-null → go to § Essential container exited
        |               exitCode=null → go to § CannotStartContainerError
        |
        +-- "health check" / "ELB" in stoppedReason ?
        |       YES → go to § ELB health-check failure
        |
        +-- stopCode = "SpotInterruption" ?
        |       YES → go to § Spot interruption (not a failure)
        |
        +-- "timeout" in stoppedReason ?
                YES → go to § ContainerRuntimeTimeoutError

CannotPullContainerError: the image can't be pulled

Symptom

The task stops at PROVISIONINGPENDING, and stoppedReason shows the following.

CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref ...

exitCode is null (hasn't reached app startup).

Per-cause checklist

① Lack of VPC endpoints / NAT (most common)

A configuration in a private subnet with assignPublicIp=false, yet none of the following exists.

  • A NAT Gateway (the outbound route)
  • VPC endpoints for ECR (com.amazonaws.<region>.ecr.api + com.amazonaws.<region>.ecr.dkr)
  • An S3 gateway endpoint (mandatory because ECR's layers are stored in S3)
# VPCエンドポイント一覧を確認
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=<vpc-id>" \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State}'

The minimum set needed (private-subnet configuration):

EndpointTypeUse
ecr.apiInterfaceResolving the task definition's image URL
ecr.dkrInterfacePulling layers (Docker Registry API)
s3GatewayThe substance of ECR layers (stored in S3)
logsInterfaceThe awslogs driver (CloudWatch Logs)
secretsmanagerInterfacesecrets valueFrom (if used)
ssmmessagesInterfaceECS Exec (if used)

② The execution role has no ECR permission

executionRoleArn doesn't have AmazonECSTaskExecutionRolePolicy attached, or a custom policy is insufficient.

{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken",
    "ecr:BatchCheckLayerAvailability",
    "ecr:GetDownloadUrlForLayer",
    "ecr:BatchGetImage"
  ],
  "Resource": "*"
}

Note: ecr:GetAuthorizationToken doesn't function unless the resource is *.

③ Wrong tag/digest

The specified tag doesn't exist in ECR, or the digest has changed.

# ECR でタグ一覧を確認
aws ecr describe-images \
  --repository-name web-api \
  --query 'imageDetails[*].{Tags:imageTags,Pushed:imagePushedAt}' \
  --output table

④ Docker Hub rate limit

If you use a Docker Hub official image (like node:20) directly, you may hit the anonymous-pull rate limit. Switch to an operation that mirrors to ECR Public or ECR and pulls from there.

Fix flow

1. プライベートサブネット構成か確認
   YES → VPCエンドポイント(ecr.api / ecr.dkr / s3)を追加
         または NAT Gateway を確認

2. 実行ロールのポリシーを確認
   → AmazonECSTaskExecutionRolePolicy がアタッチされているか

3. イメージURIを確認
   → ECR にそのタグが存在するか describe-images で確認

4. SGを確認
   → VPCエンドポイントのSGがタスクのSGからの443を許可しているか

Prevention

Manage VPC endpoints as code with Terraform, and detect a missing route at the Plan stage. Do ECR pushes in the CI pipeline, and confirm the tag's existence right after push before triggering the deploy.


ResourceInitializationError: resource-initialization failure

Symptom

An error in the early phase of task startup. stoppedReason has a message like the below.

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: ...

Or:

ResourceInitializationError: failed to configure ENI: ...

Cause and diagnosis

ResourceInitializationError sits in a position close to the parent category of CannotPullContainer. It occurs across the whole "pre-startup initialization phase" where the Fargate agent does network configuration, secret retrieval, and log initialization.

Main causes:

CauseCheckpoint
Secret-retrieval failure (Secrets Manager / SSM unreachable)Presence of the VPC endpoints secretsmanager / ssm
The execution role has no GetSecretValue / GetParameter permissionCheck the execution role's policy
Log-init failure (CloudWatch Logs unreachable)Presence of the VPC endpoint logs, logs:CreateLogStream on the role
ENI-allocation failure (subnet IP exhaustion)Confirm the subnet's free IP count with describe-subnets
The SG blocks outboundCheck the task SG's outbound rules
# サブネットの空きIPを確認
aws ec2 describe-subnets \
  --subnet-ids <subnet-id> \
  --query 'Subnets[*].{CIDR:CidrBlock,Available:AvailableIpAddressCount}'

Confusing the execution role and the task role generates chain errors

This is the most overlooked pattern. It's detailed in the ECS on Fargate production-operations guide too, but let me organize it again.

RoleConfig keyUsed byTypical permissions
Execution roleexecutionRoleArnThe ECS agent (at startup)ECR pull, CloudWatch Logs write, Secrets Manager retrieval
Task roletaskRoleArnApp code (during execution)App-specific AWS resources like S3, DynamoDB, SQS

Secret injection (secrets[].valueFrom) is retrieved by the agent at startup, so the permission goes on the execution role. Only when the app calls aws secretsmanager get-secret-value during execution does it go on the task role. Not keeping this principle manifests as ResourceInitializationError.


OutOfMemoryError: memory-limit overrun

Symptom

The task's stoppedReason has:

OutOfMemoryError: Container killed due to memory usage

or containers[].exitCode = 137 (Linux's OOMKill is exit code 137).

Cause and diagnosis

Fargate's memory limit is set at the container level. When a container's memory usage exceeds the task definition's memory (the upper bound), the Linux kernel's OOM Killer force-terminates the process.

# Container Insights でメモリ使用量を確認(クエリ例)
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=web-api \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average Maximum

Sorting the cause:

PatternHow to tellFix
Underestimated task sizeConstant OOMKill. The max reaches the limitChange the task size to the next pair up
Memory leakNormal right after startup but gradually grows and diesInvestigate with a heap dump/profiler
Momentary overrun at spikesOnly in specific time bands or under high loadSet soft (memoryReservation) and hard (memory) limits separately

The task definition's memory settings have two levels.

{
  "name": "app",
  "memory": 1024,
  "memoryReservation": 512
}
  • memory (hard limit): exceeding this is OOMKill
  • memoryReservation (soft limit): a guideline for scheduling. Exceeding it isn't an immediate kill

To give spike tolerance, set memoryReservation a little above normal usage and memory to about 1.5–2× that, with a buffer of room. But Fargate's task memory upper bound equals task.memory, and if the containers' sum exceeds it, it can't start.


Essential container exited / non-zero exitCode: app-startup failure

Symptom

Essential container in task exited

stopCode = EssentialContainerExited, containers[].exitCode is non-zero (other than 0).

Diagnosis procedure

Step 1: Check CloudWatch Logs (most important)

# ログストリームの最新を取得
aws logs get-log-events \
  --log-group-name /ecs/web-api \
  --log-stream-name app/<task-id> \
  --limit 50 \
  --query 'events[*].message'

Fargate writes the app's STDOUT/STDERR to CloudWatch Logs via the awslogs driver. A record of the app panicking/exiting on "the config file can't be read," "DB connection failed," "the port is already in use," etc. always remains here.

Step 2: Check for missing environment variables / secrets

If a secret's ARN referenced by secrets[].valueFrom is wrong or the relevant version doesn't exist, the environment variable needed at startup becomes empty and the app crashes.

# シークレットの存在確認
aws secretsmanager describe-secret \
  --secret-id arn:aws:secretsmanager:ap-northeast-1:111122223333:secret:prod/db-Ab12Cd

Step 3: Check CMD / ENTRYPOINT

A case where the task definition's command is wrong or references a nonexistent path.

# ローカルでイメージを起動して再現確認
docker run --rm \
  -e DATABASE_URL=<test-url> \
  111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/web-api:latest

A list of common exitCodes:

exitCodeMeaningDirection of the fix
1A general app errorCheck the error content in CloudWatch Logs
2Misuse of Bash / a shell errorCheck CMD/ENTRYPOINT syntax
127Command not foundA path / binary-name error
137SIGKILL (OOMKill or manual kill)Memory settings or a StopTask operation
143SIGTERM (graceful shutdown accepted normally)Normal stop. Confirm it's an intentional stop

Task failed ELB health checks: failure to pass the health check

Symptom

The task is up but the service is judged unhealthy, and tasks repeatedly swap out. The ECS service's events show the following.

service web-api (port 8080) is unhealthy in target-group arn:... due to (reason Health checks failed)

Diagnosis flow

① Confirm target_type = "ip" (a Fargate-specific pitfall)

If the ALB's target group is left as target_type = "instance", Fargate tasks can't be registered correctly.

aws elbv2 describe-target-groups \
  --query 'TargetGroups[*].{Name:TargetGroupName,TargetType:TargetType}'

Always confirm it's ip. If it's instance, recreate it (can't be changed).

② Confirm the health-check path and port

Is the health-check path (e.g. /healthz) set on the ALB's target group implemented in the app and returning on the correct port.

# タスクのENI IPを取得してヘルスチェックを手動実行
TASK_ENI_IP=$(aws ecs describe-tasks \
  --cluster prod --tasks <task-id> \
  --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
  --output text)

# VPC内の踏み台から確認(直接疎通できる場合)
curl -v http://${TASK_ENI_IP}:8080/healthz

③ Is the SG blocking the health check

Even if the task's SG inbound rule is "allow only the ALB's SG," it won't pass unless it matches the source of the ALB SG's health-check sends.

# タスクSGのインバウンドルールを確認
aws ec2 describe-security-groups \
  --group-ids <task-sg-id> \
  --query 'SecurityGroups[*].IpPermissions'

④ The startPeriod and grace-period settings

An app that takes time to start (running DB migrations, loading a large model, etc.) gets judged "unhealthy" and killed during init unless you set the grace before the health check starts.

{
  "healthCheck": {
    "startPeriod": 60,
    "interval": 15,
    "timeout": 5,
    "retries": 3
  }
}

In addition, the ECS-service side also needs health_check_grace_period_seconds.

resource "aws_ecs_service" "app" {
  # ...
  health_check_grace_period_seconds = 60
}

healthCheck.startPeriod (the health-check grace of the container in the task definition) and health_check_grace_period_seconds (the grace during which the ECS service ignores ALB health-check failures) are different things. The grace after being registered with the ALB is set by the latter.

Health-check diagnosis checklist:

  • Is target_type = "ip"
  • Does the health-check path return 200 in the app
  • Does the health-check port number match the container port
  • Does the task SG allow inbound from the ALB's SG
  • Is the task definition's healthCheck.startPeriod set
  • Is the ECS service's health_check_grace_period_seconds set

SpotInterruption: a Spot interruption is not a "failure"

Symptom

stopCode: "SpotInterruption"
stoppedReason: "Your Spot Task was interrupted."

The correct understanding

A Spot interruption is a designed behavior for AWS to reclaim capacity, not a software bug or an operational mistake. What you should address is "is it designed not to break when an interruption occurs."

How a Spot interruption works:

  1. AWS decides to reclaim capacity (usually 2 minutes before the notification)
  2. EventBridge fires an ECS Task State Change event (stopCode=SpotInterruption)
  3. SIGTERM is sent to the task (with the stopTimeout grace, max 120 seconds)
  4. SIGKILL after the grace

The 3-piece set to absorb it by design:

① グレースフルシャットダウン:SIGTERMを受けてin-flightを捌き切る
② 冪等な処理:途中でKillされても、再起動後に二重処理しない
③ 容量プロバイダ戦略:baseをオンデマンドで守り、追加分をSpotに割り当て
# Spot 中断をEventBridgeで検知してアラートへ
resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "ecs-spot-interruption"
  event_pattern = jsonencode({
    source      = ["aws.ecs"]
    detail-type = ["ECS Task State Change"]
    detail = {
      stopCode = ["SpotInterruption"]
    }
  })
}

For the graceful-shutdown implementation pattern, see the "end cleanly on SIGTERM" section of the ECS on Fargate production-operations guide. stopTimeout's default is 30 seconds and the max is 120 seconds. In production, set it explicitly to match the app's drain time.


CannotStartContainerError / ContainerRuntimeTimeoutError

Symptom

CannotStartContainerError: ...
ContainerRuntimeTimeoutError: Timeout waiting for container to start

exitCode is null (hasn't reached container-process startup).

Cause and diagnosis

CannotStartContainerError:

  • A non-executable binary specified in command / entryPoint
  • Can't write to the volumes mount target (especially with readonlyRootFilesystem: true)
  • A UID specified by user doesn't exist in the container
  • A linuxParameters dependency setting unsupported on Fargate
# ローカルで同じ設定を再現
docker run --rm \
  --user 10001:10001 \
  --read-only \
  --tmpfs /tmp \
  my-image:tag

ContainerRuntimeTimeoutError:

The container's startup sequence (the start of running ENTRYPOINT/CMD) times out. Such as when the wait for a connection to a dependency service is too long. If the init at startup is waiting on an external service, consider extending the health check's startPeriod or decoupling external dependencies from the startup sequence.


ECS Exec: investigate inside a running container

When a task is running (or during debugging right after startup), what lets you enter directly into the container to investigate is ECS Exec. No SSH, no port opening, no key management needed.

Prerequisites

enableExecuteCommand: true on the ECS service / task

resource "aws_ecs_service" "app" {
  enable_execute_command = true
  # ...
}

Note: enableExecuteCommand is effective only for newly-started tasks. It can't be retrofitted to existing tasks. After changing the setting, swap tasks with force-new-deployment.

② The 4 ssmmessages actions on the task role

{
  "Effect": "Allow",
  "Action": [
    "ssmmessages:CreateControlChannel",
    "ssmmessages:CreateDataChannel",
    "ssmmessages:OpenControlChannel",
    "ssmmessages:OpenDataChannel"
  ],
  "Resource": "*"
}

This is a permission to put on the task role (taskRoleArn), not the execution role.

③ The SSM Session Manager plugin (client side)

# macOS
brew install session-manager-plugin

The actual commands

# タスクIDを確認
TASK_ID=$(aws ecs list-tasks \
  --cluster prod \
  --service-name web-api \
  --query 'taskArns[0]' \
  --output text | awk -F/ '{print $NF}')

# コンテナに入る
aws ecs execute-command \
  --cluster prod \
  --task ${TASK_ID} \
  --container app \
  --interactive \
  --command "/bin/sh"

Once in the shell, you can directly check environment variables, file existence, network connectivity, and more.

# コンテナ内で環境変数の確認
env | grep DATABASE

# DB への疎通確認
nc -zv db.internal 5432

# プロセスの確認
ps aux

Auditing ECS Exec

All ECS Exec operations are recorded in CloudTrail (as ExecuteCommand API calls). Furthermore, configuring logging: OVERRIDE lets you save the session's input/output to CloudWatch Logs or S3. For access to production containers, I recommend always enabling this audit log.

For a deeper observability implementation, see Observability with OpenTelemetry × ECS.


Prevention: stop recurrence with observability

Once you've solved a problem, put in structural prevention so you don't spend time on the same problem again.

1. Structured logs + correlation ID

Output all logs as JSON and always include requestId / traceId. You can then filter in CloudWatch Logs Insights.

// 構造化ログの最小実装
const log = (level: string, msg: string, ctx: Record<string, unknown> = {}) => {
  process.stdout.write(
    JSON.stringify({ level, msg, timestamp: new Date().toISOString(), ...ctx }) + "\n"
  );
};

// リクエストIDを全ログに通す
log("info", "server:start", { port: 8080, env: process.env.NODE_ENV });

2. Container Insights + alerts

resource "aws_cloudwatch_metric_alarm" "task_stopped" {
  alarm_name          = "ecs-task-stopped-abnormally"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "RunningTaskCount"
  namespace           = "ECS/ContainerInsights"
  period              = 60
  statistic           = "Minimum"
  threshold           = 0
  alarm_description   = "全タスクが停止した(desired > 0 にもかかわらず)"
  dimensions = {
    ClusterName = "prod"
    ServiceName = "web-api"
  }
}

3. Deployment circuit breaker

When new tasks consecutively fail the health check, automatically roll back to the previous revision.

resource "aws_ecs_service" "app" {
  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

This alone structurally prevents the accident of "a misdeploy stays stuck in production." Integration with the CI/CD pipeline is detailed in the ECS on Fargate CI/CD guide.

4. Always set the health check's startPeriod

{
  "healthCheck": {
    "startPeriod": 30,
    "interval": 15,
    "timeout": 5,
    "retries": 3
  }
}

The default without startPeriod is 0 seconds. A task gets dropped during a temporary unhealthy state at startup.

5. Codify the network design with Terraform

A lack of VPC endpoints is the most common cause of CannotPullContainer / ResourceInitializationError. Manage the patterns from the ECS on Fargate networking guide with Terraform and eliminate config differences between environments.


Quick reference: stop reason → where to look first → typical cause → fix

Stop reason / stopCodeWhere to look firstTypical causeFix
CannotPullContainerErrorVPC endpoint list, execution roleMissing NAT/VPC endpoint in a private subnet, insufficient ECR permission on the execution roleAdd ECR endpoints or check NAT, grant the policy
ResourceInitializationErrorExecution role, VPC endpoints (secretsmanager/logs/ssm)Secret unreachable, log-write impossible, ENI-allocation failureAdd permissions to the execution role, add VPC endpoints, check subnet IP exhaustion
EssentialContainerExited (exitCode≠0)CloudWatch LogsApp-startup failure, missing env/secrets, wrong CMDCheck the error content in logs, fix env vars and command
EssentialContainerExited (exitCode=137)Container Insights memory graphOOMKill (memory-limit overrun)Change task memory to the next pair up, investigate a leak
EssentialContainerExited (exitCode=null)describe-tasks' containers[].reasonCannotStartContainer (pre-startup failure)Check the CMD/ENTRYPOINT path and permissions
ELB health-check failureALB target-group config, task SGtarget_type=instance, SG misconfig, insufficient startPeriodChange to target_type=ip, add SG allow, set startPeriod
SpotInterruptionThe EventBridge eventSpot capacity reclamation (normal behavior)Implement graceful shutdown, confirm idempotent design
ContainerRuntimeTimeoutErrorThe task definition's CMD/ENTRYPOINTStartup processing too long, wait for a dependency serviceOptimize the startup sequence, extend startPeriod
CannotCreateVolumeErrorThe task definition's volumes, EFS configEFS mount target unreachable, SG configCheck the EFS endpoint and SG

Summary

ECS on Fargate troubleshooting is fastest with the pattern of logically narrowing down in the order stoppedReasonstopCodeexitCode → CloudWatch Logs, rather than starting from a feeling of "something's off."

From the experience I've accumulated in practice, the vast majority of "task won't start" problems consolidate into these 3.

  1. Network reachability (a missing VPC endpoint or NAT in a private subnet)
  2. Confusing the execution role and the task role (a permission on the wrong role)
  3. Insufficient health-check grace (startPeriod and grace period not set)

The reason I could achieve zero double charges on the payment foundation is that I consistently had a design that crushes problems in advance with structure and observability, not "fix it after a failure occurs." If you want to stabilize a Fargate production environment fast and safely, feel free to reach out.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading