"After deploying, the task won't start," "it starts but dies immediately," "it never passes the ALB health check"—problems you'll surely hit at least once operating ECS on Fargate.
On a lumber-distribution B2B SaaS, I put 221 endpoints on a configuration of API Gateway → NLB → ALB → ECS on Fargate, and ran the worker group of the payment foundation (zero double charges in production) on the same foundation. Task-stop problems are crushed not by "operational carefulness" but by a diagnostic pattern and structural prevention. Before staring at CloudWatch and feeling "something's off," the shortest route is to logically narrow the cause from stoppedReason and stopCode.
This article systematizes ECS Fargate's task stop reasons by category, showing end-to-end "where to look → what's the cause → how to fix → how to prevent recurrence." Reading the ECS on Fargate production-operations guide first for the whole picture of production design deepens your understanding.
Where to look first: the starting point of diagnosis
From the moment you notice "a task died," there are 4 places to check.
| Place to check | What it tells you |
|---|---|
| The console's "Stopped tasks" tab | A summary of stoppedReason, exitCode, the final status |
aws ecs describe-tasks | Structured data of all fields (the most detailed) |
| CloudWatch Logs (awslogs) | Logs the app wrote itself (panics, startup failures, etc.) |
| The EventBridge "ECS Task State Change" event | Async stop notification. The starting point for alert integration / automated response |
The first move is this, no other choice.
aws ecs describe-tasks \
--cluster prod \
--tasks <task-id> \
--query 'tasks[0].{lastStatus:lastStatus,stoppedReason:stoppedReason,stopCode:stopCode,containers:containers[*].{name:name,reason:reason,exitCode:exitCode,lastStatus:lastStatus}}' \
--output json
How to read describe-tasks: always confirm 5 fields
{
"lastStatus": "STOPPED",
"stoppedReason": "Essential container in task exited",
"stopCode": "EssentialContainerExited",
"containers": [
{
"name": "app",
"lastStatus": "STOPPED",
"exitCode": 1,
"reason": ""
}
]
}
| Field | Meaning | Point to note |
|---|---|---|
lastStatus | The task's current overall state | STOPPED is confirmed |
stoppedReason | Human-facing stop-reason text | You can read the category like "CannotPullContainerError" or "Your Spot Task was interrupted." |
stopCode | A machine-judgeable stop code | Used for EventBridge filters and alerts |
containers[].exitCode | The container process's exit code | 0=normal, 1=app error, 137=OOMKill, null=hasn't even reached startup |
containers[].reason | The container's individual reason | Details for pull errors etc. appear here |
When exitCode is null, the failure is on the Fargate-agent side before the app even started (CannotPullContainer / ResourceInitializationError, etc.). Before chasing the app's logs, crush the agent-side cause.
Diagnostic flowchart
A task became STOPPED
|
+-- "CannotPullContainer" in stoppedReason ?
| YES → go to § CannotPullContainerError
|
+-- "ResourceInitializationError" in stoppedReason ?
| YES → go to § ResourceInitializationError
|
+-- stopCode = "EssentialContainerExited" ?
| YES → check containers[].exitCode
| exitCode=137 → go to § OutOfMemoryError
| exitCode≠0, non-null → go to § Essential container exited
| exitCode=null → go to § CannotStartContainerError
|
+-- "health check" / "ELB" in stoppedReason ?
| YES → go to § ELB health-check failure
|
+-- stopCode = "SpotInterruption" ?
| YES → go to § Spot interruption (not a failure)
|
+-- "timeout" in stoppedReason ?
YES → go to § ContainerRuntimeTimeoutError
CannotPullContainerError: the image can't be pulled
Symptom
The task stops at PROVISIONING → PENDING, and stoppedReason shows the following.
CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref ...
exitCode is null (hasn't reached app startup).
Per-cause checklist
① Lack of VPC endpoints / NAT (most common)
A configuration in a private subnet with assignPublicIp=false, yet none of the following exists.
- A NAT Gateway (the outbound route)
- VPC endpoints for ECR (
com.amazonaws.<region>.ecr.api+com.amazonaws.<region>.ecr.dkr) - An S3 gateway endpoint (mandatory because ECR's layers are stored in S3)
# VPCエンドポイント一覧を確認
aws ec2 describe-vpc-endpoints \
--filters "Name=vpc-id,Values=<vpc-id>" \
--query 'VpcEndpoints[*].{Service:ServiceName,State:State}'
The minimum set needed (private-subnet configuration):
| Endpoint | Type | Use |
|---|---|---|
ecr.api | Interface | Resolving the task definition's image URL |
ecr.dkr | Interface | Pulling layers (Docker Registry API) |
s3 | Gateway | The substance of ECR layers (stored in S3) |
logs | Interface | The awslogs driver (CloudWatch Logs) |
secretsmanager | Interface | secrets valueFrom (if used) |
ssmmessages | Interface | ECS Exec (if used) |
② The execution role has no ECR permission
executionRoleArn doesn't have AmazonECSTaskExecutionRolePolicy attached, or a custom policy is insufficient.
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": "*"
}
Note:
ecr:GetAuthorizationTokendoesn't function unless the resource is*.
③ Wrong tag/digest
The specified tag doesn't exist in ECR, or the digest has changed.
# ECR でタグ一覧を確認
aws ecr describe-images \
--repository-name web-api \
--query 'imageDetails[*].{Tags:imageTags,Pushed:imagePushedAt}' \
--output table
④ Docker Hub rate limit
If you use a Docker Hub official image (like node:20) directly, you may hit the anonymous-pull rate limit. Switch to an operation that mirrors to ECR Public or ECR and pulls from there.
Fix flow
1. プライベートサブネット構成か確認
YES → VPCエンドポイント(ecr.api / ecr.dkr / s3)を追加
または NAT Gateway を確認
2. 実行ロールのポリシーを確認
→ AmazonECSTaskExecutionRolePolicy がアタッチされているか
3. イメージURIを確認
→ ECR にそのタグが存在するか describe-images で確認
4. SGを確認
→ VPCエンドポイントのSGがタスクのSGからの443を許可しているか
Prevention
Manage VPC endpoints as code with Terraform, and detect a missing route at the Plan stage. Do ECR pushes in the CI pipeline, and confirm the tag's existence right after push before triggering the deploy.
ResourceInitializationError: resource-initialization failure
Symptom
An error in the early phase of task startup. stoppedReason has a message like the below.
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: ...
Or:
ResourceInitializationError: failed to configure ENI: ...
Cause and diagnosis
ResourceInitializationError sits in a position close to the parent category of CannotPullContainer. It occurs across the whole "pre-startup initialization phase" where the Fargate agent does network configuration, secret retrieval, and log initialization.
Main causes:
| Cause | Checkpoint |
|---|---|
| Secret-retrieval failure (Secrets Manager / SSM unreachable) | Presence of the VPC endpoints secretsmanager / ssm |
The execution role has no GetSecretValue / GetParameter permission | Check the execution role's policy |
| Log-init failure (CloudWatch Logs unreachable) | Presence of the VPC endpoint logs, logs:CreateLogStream on the role |
| ENI-allocation failure (subnet IP exhaustion) | Confirm the subnet's free IP count with describe-subnets |
| The SG blocks outbound | Check the task SG's outbound rules |
# サブネットの空きIPを確認
aws ec2 describe-subnets \
--subnet-ids <subnet-id> \
--query 'Subnets[*].{CIDR:CidrBlock,Available:AvailableIpAddressCount}'
Confusing the execution role and the task role generates chain errors
This is the most overlooked pattern. It's detailed in the ECS on Fargate production-operations guide too, but let me organize it again.
| Role | Config key | Used by | Typical permissions |
|---|---|---|---|
| Execution role | executionRoleArn | The ECS agent (at startup) | ECR pull, CloudWatch Logs write, Secrets Manager retrieval |
| Task role | taskRoleArn | App code (during execution) | App-specific AWS resources like S3, DynamoDB, SQS |
Secret injection (secrets[].valueFrom) is retrieved by the agent at startup, so the permission goes on the execution role. Only when the app calls aws secretsmanager get-secret-value during execution does it go on the task role. Not keeping this principle manifests as ResourceInitializationError.
OutOfMemoryError: memory-limit overrun
Symptom
The task's stoppedReason has:
OutOfMemoryError: Container killed due to memory usage
or containers[].exitCode = 137 (Linux's OOMKill is exit code 137).
Cause and diagnosis
Fargate's memory limit is set at the container level. When a container's memory usage exceeds the task definition's memory (the upper bound), the Linux kernel's OOM Killer force-terminates the process.
# Container Insights でメモリ使用量を確認(クエリ例)
aws cloudwatch get-metric-statistics \
--namespace ECS/ContainerInsights \
--metric-name MemoryUtilized \
--dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=web-api \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Average Maximum
Sorting the cause:
| Pattern | How to tell | Fix |
|---|---|---|
| Underestimated task size | Constant OOMKill. The max reaches the limit | Change the task size to the next pair up |
| Memory leak | Normal right after startup but gradually grows and dies | Investigate with a heap dump/profiler |
| Momentary overrun at spikes | Only in specific time bands or under high load | Set soft (memoryReservation) and hard (memory) limits separately |
The task definition's memory settings have two levels.
{
"name": "app",
"memory": 1024,
"memoryReservation": 512
}
memory(hard limit): exceeding this is OOMKillmemoryReservation(soft limit): a guideline for scheduling. Exceeding it isn't an immediate kill
To give spike tolerance, set memoryReservation a little above normal usage and memory to about 1.5–2× that, with a buffer of room. But Fargate's task memory upper bound equals task.memory, and if the containers' sum exceeds it, it can't start.
Essential container exited / non-zero exitCode: app-startup failure
Symptom
Essential container in task exited
stopCode = EssentialContainerExited, containers[].exitCode is non-zero (other than 0).
Diagnosis procedure
Step 1: Check CloudWatch Logs (most important)
# ログストリームの最新を取得
aws logs get-log-events \
--log-group-name /ecs/web-api \
--log-stream-name app/<task-id> \
--limit 50 \
--query 'events[*].message'
Fargate writes the app's STDOUT/STDERR to CloudWatch Logs via the awslogs driver. A record of the app panicking/exiting on "the config file can't be read," "DB connection failed," "the port is already in use," etc. always remains here.
Step 2: Check for missing environment variables / secrets
If a secret's ARN referenced by secrets[].valueFrom is wrong or the relevant version doesn't exist, the environment variable needed at startup becomes empty and the app crashes.
# シークレットの存在確認
aws secretsmanager describe-secret \
--secret-id arn:aws:secretsmanager:ap-northeast-1:111122223333:secret:prod/db-Ab12Cd
Step 3: Check CMD / ENTRYPOINT
A case where the task definition's command is wrong or references a nonexistent path.
# ローカルでイメージを起動して再現確認
docker run --rm \
-e DATABASE_URL=<test-url> \
111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/web-api:latest
A list of common exitCodes:
| exitCode | Meaning | Direction of the fix |
|---|---|---|
| 1 | A general app error | Check the error content in CloudWatch Logs |
| 2 | Misuse of Bash / a shell error | Check CMD/ENTRYPOINT syntax |
| 127 | Command not found | A path / binary-name error |
| 137 | SIGKILL (OOMKill or manual kill) | Memory settings or a StopTask operation |
| 143 | SIGTERM (graceful shutdown accepted normally) | Normal stop. Confirm it's an intentional stop |
Task failed ELB health checks: failure to pass the health check
Symptom
The task is up but the service is judged unhealthy, and tasks repeatedly swap out. The ECS service's events show the following.
service web-api (port 8080) is unhealthy in target-group arn:... due to (reason Health checks failed)
Diagnosis flow
① Confirm target_type = "ip" (a Fargate-specific pitfall)
If the ALB's target group is left as target_type = "instance", Fargate tasks can't be registered correctly.
aws elbv2 describe-target-groups \
--query 'TargetGroups[*].{Name:TargetGroupName,TargetType:TargetType}'
Always confirm it's ip. If it's instance, recreate it (can't be changed).
② Confirm the health-check path and port
Is the health-check path (e.g. /healthz) set on the ALB's target group implemented in the app and returning on the correct port.
# タスクのENI IPを取得してヘルスチェックを手動実行
TASK_ENI_IP=$(aws ecs describe-tasks \
--cluster prod --tasks <task-id> \
--query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
--output text)
# VPC内の踏み台から確認(直接疎通できる場合)
curl -v http://${TASK_ENI_IP}:8080/healthz
③ Is the SG blocking the health check
Even if the task's SG inbound rule is "allow only the ALB's SG," it won't pass unless it matches the source of the ALB SG's health-check sends.
# タスクSGのインバウンドルールを確認
aws ec2 describe-security-groups \
--group-ids <task-sg-id> \
--query 'SecurityGroups[*].IpPermissions'
④ The startPeriod and grace-period settings
An app that takes time to start (running DB migrations, loading a large model, etc.) gets judged "unhealthy" and killed during init unless you set the grace before the health check starts.
{
"healthCheck": {
"startPeriod": 60,
"interval": 15,
"timeout": 5,
"retries": 3
}
}
In addition, the ECS-service side also needs health_check_grace_period_seconds.
resource "aws_ecs_service" "app" {
# ...
health_check_grace_period_seconds = 60
}
healthCheck.startPeriod (the health-check grace of the container in the task definition) and health_check_grace_period_seconds (the grace during which the ECS service ignores ALB health-check failures) are different things. The grace after being registered with the ALB is set by the latter.
Health-check diagnosis checklist:
- Is
target_type = "ip" - Does the health-check path return
200in the app - Does the health-check port number match the container port
- Does the task SG allow inbound from the ALB's SG
- Is the task definition's
healthCheck.startPeriodset - Is the ECS service's
health_check_grace_period_secondsset
SpotInterruption: a Spot interruption is not a "failure"
Symptom
stopCode: "SpotInterruption"
stoppedReason: "Your Spot Task was interrupted."
The correct understanding
A Spot interruption is a designed behavior for AWS to reclaim capacity, not a software bug or an operational mistake. What you should address is "is it designed not to break when an interruption occurs."
How a Spot interruption works:
- AWS decides to reclaim capacity (usually 2 minutes before the notification)
- EventBridge fires an
ECS Task State Changeevent (stopCode=SpotInterruption) - SIGTERM is sent to the task (with the
stopTimeoutgrace, max 120 seconds) - SIGKILL after the grace
The 3-piece set to absorb it by design:
① グレースフルシャットダウン:SIGTERMを受けてin-flightを捌き切る
② 冪等な処理:途中でKillされても、再起動後に二重処理しない
③ 容量プロバイダ戦略:baseをオンデマンドで守り、追加分をSpotに割り当て
# Spot 中断をEventBridgeで検知してアラートへ
resource "aws_cloudwatch_event_rule" "spot_interruption" {
name = "ecs-spot-interruption"
event_pattern = jsonencode({
source = ["aws.ecs"]
detail-type = ["ECS Task State Change"]
detail = {
stopCode = ["SpotInterruption"]
}
})
}
For the graceful-shutdown implementation pattern, see the "end cleanly on SIGTERM" section of the ECS on Fargate production-operations guide. stopTimeout's default is 30 seconds and the max is 120 seconds. In production, set it explicitly to match the app's drain time.
CannotStartContainerError / ContainerRuntimeTimeoutError
Symptom
CannotStartContainerError: ...
ContainerRuntimeTimeoutError: Timeout waiting for container to start
exitCode is null (hasn't reached container-process startup).
Cause and diagnosis
CannotStartContainerError:
- A non-executable binary specified in
command/entryPoint - Can't write to the
volumesmount target (especially withreadonlyRootFilesystem: true) - A UID specified by
userdoesn't exist in the container - A
linuxParametersdependency setting unsupported on Fargate
# ローカルで同じ設定を再現
docker run --rm \
--user 10001:10001 \
--read-only \
--tmpfs /tmp \
my-image:tag
ContainerRuntimeTimeoutError:
The container's startup sequence (the start of running ENTRYPOINT/CMD) times out. Such as when the wait for a connection to a dependency service is too long. If the init at startup is waiting on an external service, consider extending the health check's startPeriod or decoupling external dependencies from the startup sequence.
ECS Exec: investigate inside a running container
When a task is running (or during debugging right after startup), what lets you enter directly into the container to investigate is ECS Exec. No SSH, no port opening, no key management needed.
Prerequisites
① enableExecuteCommand: true on the ECS service / task
resource "aws_ecs_service" "app" {
enable_execute_command = true
# ...
}
Note: enableExecuteCommand is effective only for newly-started tasks. It can't be retrofitted to existing tasks. After changing the setting, swap tasks with force-new-deployment.
② The 4 ssmmessages actions on the task role
{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel"
],
"Resource": "*"
}
This is a permission to put on the task role (taskRoleArn), not the execution role.
③ The SSM Session Manager plugin (client side)
# macOS
brew install session-manager-plugin
The actual commands
# タスクIDを確認
TASK_ID=$(aws ecs list-tasks \
--cluster prod \
--service-name web-api \
--query 'taskArns[0]' \
--output text | awk -F/ '{print $NF}')
# コンテナに入る
aws ecs execute-command \
--cluster prod \
--task ${TASK_ID} \
--container app \
--interactive \
--command "/bin/sh"
Once in the shell, you can directly check environment variables, file existence, network connectivity, and more.
# コンテナ内で環境変数の確認
env | grep DATABASE
# DB への疎通確認
nc -zv db.internal 5432
# プロセスの確認
ps aux
Auditing ECS Exec
All ECS Exec operations are recorded in CloudTrail (as ExecuteCommand API calls). Furthermore, configuring logging: OVERRIDE lets you save the session's input/output to CloudWatch Logs or S3. For access to production containers, I recommend always enabling this audit log.
For a deeper observability implementation, see Observability with OpenTelemetry × ECS.
Prevention: stop recurrence with observability
Once you've solved a problem, put in structural prevention so you don't spend time on the same problem again.
1. Structured logs + correlation ID
Output all logs as JSON and always include requestId / traceId. You can then filter in CloudWatch Logs Insights.
// 構造化ログの最小実装
const log = (level: string, msg: string, ctx: Record<string, unknown> = {}) => {
process.stdout.write(
JSON.stringify({ level, msg, timestamp: new Date().toISOString(), ...ctx }) + "\n"
);
};
// リクエストIDを全ログに通す
log("info", "server:start", { port: 8080, env: process.env.NODE_ENV });
2. Container Insights + alerts
resource "aws_cloudwatch_metric_alarm" "task_stopped" {
alarm_name = "ecs-task-stopped-abnormally"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "RunningTaskCount"
namespace = "ECS/ContainerInsights"
period = 60
statistic = "Minimum"
threshold = 0
alarm_description = "全タスクが停止した(desired > 0 にもかかわらず)"
dimensions = {
ClusterName = "prod"
ServiceName = "web-api"
}
}
3. Deployment circuit breaker
When new tasks consecutively fail the health check, automatically roll back to the previous revision.
resource "aws_ecs_service" "app" {
deployment_circuit_breaker {
enable = true
rollback = true
}
}
This alone structurally prevents the accident of "a misdeploy stays stuck in production." Integration with the CI/CD pipeline is detailed in the ECS on Fargate CI/CD guide.
4. Always set the health check's startPeriod
{
"healthCheck": {
"startPeriod": 30,
"interval": 15,
"timeout": 5,
"retries": 3
}
}
The default without startPeriod is 0 seconds. A task gets dropped during a temporary unhealthy state at startup.
5. Codify the network design with Terraform
A lack of VPC endpoints is the most common cause of CannotPullContainer / ResourceInitializationError. Manage the patterns from the ECS on Fargate networking guide with Terraform and eliminate config differences between environments.
Quick reference: stop reason → where to look first → typical cause → fix
| Stop reason / stopCode | Where to look first | Typical cause | Fix |
|---|---|---|---|
CannotPullContainerError | VPC endpoint list, execution role | Missing NAT/VPC endpoint in a private subnet, insufficient ECR permission on the execution role | Add ECR endpoints or check NAT, grant the policy |
ResourceInitializationError | Execution role, VPC endpoints (secretsmanager/logs/ssm) | Secret unreachable, log-write impossible, ENI-allocation failure | Add permissions to the execution role, add VPC endpoints, check subnet IP exhaustion |
EssentialContainerExited (exitCode≠0) | CloudWatch Logs | App-startup failure, missing env/secrets, wrong CMD | Check the error content in logs, fix env vars and command |
EssentialContainerExited (exitCode=137) | Container Insights memory graph | OOMKill (memory-limit overrun) | Change task memory to the next pair up, investigate a leak |
EssentialContainerExited (exitCode=null) | describe-tasks' containers[].reason | CannotStartContainer (pre-startup failure) | Check the CMD/ENTRYPOINT path and permissions |
| ELB health-check failure | ALB target-group config, task SG | target_type=instance, SG misconfig, insufficient startPeriod | Change to target_type=ip, add SG allow, set startPeriod |
SpotInterruption | The EventBridge event | Spot capacity reclamation (normal behavior) | Implement graceful shutdown, confirm idempotent design |
ContainerRuntimeTimeoutError | The task definition's CMD/ENTRYPOINT | Startup processing too long, wait for a dependency service | Optimize the startup sequence, extend startPeriod |
CannotCreateVolumeError | The task definition's volumes, EFS config | EFS mount target unreachable, SG config | Check the EFS endpoint and SG |
Summary
ECS on Fargate troubleshooting is fastest with the pattern of logically narrowing down in the order stoppedReason → stopCode → exitCode → CloudWatch Logs, rather than starting from a feeling of "something's off."
From the experience I've accumulated in practice, the vast majority of "task won't start" problems consolidate into these 3.
- Network reachability (a missing VPC endpoint or NAT in a private subnet)
- Confusing the execution role and the task role (a permission on the wrong role)
- Insufficient health-check grace (startPeriod and grace period not set)
The reason I could achieve zero double charges on the payment foundation is that I consistently had a design that crushes problems in advance with structure and observability, not "fix it after a failure occurs." If you want to stabilize a Fargate production environment fast and safely, feel free to reach out.