ECS on Fargate Troubleshooting Complete Guide: Diagnosing and Fixing Why Tasks Won't Start or Die Immediately, by Stop-Reason Code

"After deploying, the task won't start," "it starts but dies immediately," "it never passes the ALB health check"—problems you'll surely hit at least once operating ECS on Fargate.

On a lumber-distribution B2B SaaS, I put 221 endpoints on a configuration of API Gateway → NLB → ALB → ECS on Fargate, and ran the worker group of the payment foundation (zero double charges in production) on the same foundation. Task-stop problems are crushed not by "operational carefulness" but by a diagnostic pattern and structural prevention. Before staring at CloudWatch and feeling "something's off," the shortest route is to logically narrow the cause from stoppedReason and stopCode.

This article systematizes ECS Fargate's task stop reasons by category, showing end-to-end "where to look → what's the cause → how to fix → how to prevent recurrence." Reading the ECS on Fargate production-operations guide first for the whole picture of production design deepens your understanding.

Where to look first: the starting point of diagnosis

From the moment you notice "a task died," there are 4 places to check.

Place to check	What it tells you
The console's "Stopped tasks" tab	A summary of stoppedReason, exitCode, the final status
`aws ecs describe-tasks`	Structured data of all fields (the most detailed)
CloudWatch Logs (awslogs)	Logs the app wrote itself (panics, startup failures, etc.)
The EventBridge "ECS Task State Change" event	Async stop notification. The starting point for alert integration / automated response

The first move is this, no other choice.

aws ecs describe-tasks \
  --cluster prod \
  --tasks <task-id> \
  --query 'tasks[0].{lastStatus:lastStatus,stoppedReason:stoppedReason,stopCode:stopCode,containers:containers[*].{name:name,reason:reason,exitCode:exitCode,lastStatus:lastStatus}}' \
  --output json

How to read describe-tasks: always confirm 5 fields

{
  "lastStatus": "STOPPED",
  "stoppedReason": "Essential container in task exited",
  "stopCode": "EssentialContainerExited",
  "containers": [
    {
      "name": "app",
      "lastStatus": "STOPPED",
      "exitCode": 1,
      "reason": ""
    }
  ]
}

Field	Meaning	Point to note
`lastStatus`	The task's current overall state	`STOPPED` is confirmed
`stoppedReason`	Human-facing stop-reason text	You can read the category like "CannotPullContainerError" or "Your Spot Task was interrupted."
`stopCode`	A machine-judgeable stop code	Used for EventBridge filters and alerts
`containers[].exitCode`	The container process's exit code	0=normal, 1=app error, 137=OOMKill, null=hasn't even reached startup
`containers[].reason`	The container's individual reason	Details for pull errors etc. appear here

When exitCode is null, the failure is on the Fargate-agent side before the app even started (CannotPullContainer / ResourceInitializationError, etc.). Before chasing the app's logs, crush the agent-side cause.

Diagnostic flowchart

A task became STOPPED
        |
        +-- "CannotPullContainer" in stoppedReason ?
        |       YES → go to § CannotPullContainerError
        |
        +-- "ResourceInitializationError" in stoppedReason ?
        |       YES → go to § ResourceInitializationError
        |
        +-- stopCode = "EssentialContainerExited" ?
        |       YES → check containers[].exitCode
        |               exitCode=137 → go to § OutOfMemoryError
        |               exitCode≠0, non-null → go to § Essential container exited
        |               exitCode=null → go to § CannotStartContainerError
        |
        +-- "health check" / "ELB" in stoppedReason ?
        |       YES → go to § ELB health-check failure
        |
        +-- stopCode = "SpotInterruption" ?
        |       YES → go to § Spot interruption (not a failure)
        |
        +-- "timeout" in stoppedReason ?
                YES → go to § ContainerRuntimeTimeoutError

CannotPullContainerError: the image can't be pulled

Symptom

The task stops at PROVISIONING → PENDING, and stoppedReason shows the following.

CannotPullContainerError: pull image manifest has been retried 1 time(s): failed to resolve ref ...

exitCode is null (hasn't reached app startup).

Per-cause checklist

① Lack of VPC endpoints / NAT (most common)

A configuration in a private subnet with assignPublicIp=false, yet none of the following exists.

A NAT Gateway (the outbound route)
VPC endpoints for ECR (com.amazonaws.<region>.ecr.api + com.amazonaws.<region>.ecr.dkr)
An S3 gateway endpoint (mandatory because ECR's layers are stored in S3)

# VPCエンドポイント一覧を確認
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=<vpc-id>" \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State}'

The minimum set needed (private-subnet configuration):

Endpoint	Type	Use
`ecr.api`	Interface	Resolving the task definition's image URL
`ecr.dkr`	Interface	Pulling layers (Docker Registry API)
`s3`	Gateway	The substance of ECR layers (stored in S3)
`logs`	Interface	The awslogs driver (CloudWatch Logs)
`secretsmanager`	Interface	secrets valueFrom (if used)
`ssmmessages`	Interface	ECS Exec (if used)

② The execution role has no ECR permission

executionRoleArn doesn't have AmazonECSTaskExecutionRolePolicy attached, or a custom policy is insufficient.

{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken",
    "ecr:BatchCheckLayerAvailability",
    "ecr:GetDownloadUrlForLayer",
    "ecr:BatchGetImage"
  ],
  "Resource": "*"
}

Note: ecr:GetAuthorizationToken doesn't function unless the resource is *.

③ Wrong tag/digest

The specified tag doesn't exist in ECR, or the digest has changed.

# ECR でタグ一覧を確認
aws ecr describe-images \
  --repository-name web-api \
  --query 'imageDetails[*].{Tags:imageTags,Pushed:imagePushedAt}' \
  --output table

④ Docker Hub rate limit

If you use a Docker Hub official image (like node:20) directly, you may hit the anonymous-pull rate limit. Switch to an operation that mirrors to ECR Public or ECR and pulls from there.

Fix flow

1. プライベートサブネット構成か確認
   YES → VPCエンドポイント(ecr.api / ecr.dkr / s3)を追加
         または NAT Gateway を確認

2. 実行ロールのポリシーを確認
   → AmazonECSTaskExecutionRolePolicy がアタッチされているか

3. イメージURIを確認
   → ECR にそのタグが存在するか describe-images で確認

4. SGを確認
   → VPCエンドポイントのSGがタスクのSGからの443を許可しているか

Prevention

Manage VPC endpoints as code with Terraform, and detect a missing route at the Plan stage. Do ECR pushes in the CI pipeline, and confirm the tag's existence right after push before triggering the deploy.

ResourceInitializationError: resource-initialization failure

Symptom

An error in the early phase of task startup. stoppedReason has a message like the below.

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: ...

Or:

ResourceInitializationError: failed to configure ENI: ...

Cause and diagnosis

ResourceInitializationError sits in a position close to the parent category of CannotPullContainer. It occurs across the whole "pre-startup initialization phase" where the Fargate agent does network configuration, secret retrieval, and log initialization.

Main causes:

Cause	Checkpoint
Secret-retrieval failure (Secrets Manager / SSM unreachable)	Presence of the VPC endpoints `secretsmanager` / `ssm`
The execution role has no `GetSecretValue` / `GetParameter` permission	Check the execution role's policy
Log-init failure (CloudWatch Logs unreachable)	Presence of the VPC endpoint `logs`, `logs:CreateLogStream` on the role
ENI-allocation failure (subnet IP exhaustion)	Confirm the subnet's free IP count with `describe-subnets`
The SG blocks outbound	Check the task SG's outbound rules

# サブネットの空きIPを確認
aws ec2 describe-subnets \
  --subnet-ids <subnet-id> \
  --query 'Subnets[*].{CIDR:CidrBlock,Available:AvailableIpAddressCount}'

Confusing the execution role and the task role generates chain errors

This is the most overlooked pattern. It's detailed in the ECS on Fargate production-operations guide too, but let me organize it again.

Role	Config key	Used by	Typical permissions
Execution role	`executionRoleArn`	The ECS agent (at startup)	ECR pull, CloudWatch Logs write, Secrets Manager retrieval
Task role	`taskRoleArn`	App code (during execution)	App-specific AWS resources like S3, DynamoDB, SQS

Secret injection (secrets[].valueFrom) is retrieved by the agent at startup, so the permission goes on the execution role. Only when the app calls aws secretsmanager get-secret-value during execution does it go on the task role. Not keeping this principle manifests as ResourceInitializationError.

OutOfMemoryError: memory-limit overrun

Symptom

The task's stoppedReason has:

OutOfMemoryError: Container killed due to memory usage

or containers[].exitCode = 137 (Linux's OOMKill is exit code 137).

Cause and diagnosis

Fargate's memory limit is set at the container level. When a container's memory usage exceeds the task definition's memory (the upper bound), the Linux kernel's OOM Killer force-terminates the process.

# Container Insights でメモリ使用量を確認（クエリ例）
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=web-api \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average Maximum

Sorting the cause:

Pattern	How to tell	Fix
Underestimated task size	Constant OOMKill. The max reaches the limit	Change the task size to the next pair up
Memory leak	Normal right after startup but gradually grows and dies	Investigate with a heap dump/profiler
Momentary overrun at spikes	Only in specific time bands or under high load	Set soft (memoryReservation) and hard (memory) limits separately

The task definition's memory settings have two levels.

{
  "name": "app",
  "memory": 1024,
  "memoryReservation": 512
}

memory (hard limit): exceeding this is OOMKill
memoryReservation (soft limit): a guideline for scheduling. Exceeding it isn't an immediate kill

To give spike tolerance, set memoryReservation a little above normal usage and memory to about 1.5–2× that, with a buffer of room. But Fargate's task memory upper bound equals task.memory, and if the containers' sum exceeds it, it can't start.

Essential container exited / non-zero exitCode: app-startup failure

Symptom

Essential container in task exited

stopCode = EssentialContainerExited, containers[].exitCode is non-zero (other than 0).

Diagnosis procedure

Step 1: Check CloudWatch Logs (most important)

# ログストリームの最新を取得
aws logs get-log-events \
  --log-group-name /ecs/web-api \
  --log-stream-name app/<task-id> \
  --limit 50 \
  --query 'events[*].message'

Fargate writes the app's STDOUT/STDERR to CloudWatch Logs via the awslogs driver. A record of the app panicking/exiting on "the config file can't be read," "DB connection failed," "the port is already in use," etc. always remains here.

Step 2: Check for missing environment variables / secrets

If a secret's ARN referenced by secrets[].valueFrom is wrong or the relevant version doesn't exist, the environment variable needed at startup becomes empty and the app crashes.

# シークレットの存在確認
aws secretsmanager describe-secret \
  --secret-id arn:aws:secretsmanager:ap-northeast-1:111122223333:secret:prod/db-Ab12Cd

Step 3: Check CMD / ENTRYPOINT

A case where the task definition's command is wrong or references a nonexistent path.

# ローカルでイメージを起動して再現確認
docker run --rm \
  -e DATABASE_URL=<test-url> \
  111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/web-api:latest

A list of common exitCodes:

exitCode	Meaning	Direction of the fix
1	A general app error	Check the error content in CloudWatch Logs
2	Misuse of Bash / a shell error	Check CMD/ENTRYPOINT syntax
127	Command not found	A path / binary-name error
137	SIGKILL (OOMKill or manual kill)	Memory settings or a StopTask operation
143	SIGTERM (graceful shutdown accepted normally)	Normal stop. Confirm it's an intentional stop

Task failed ELB health checks: failure to pass the health check

Symptom

The task is up but the service is judged unhealthy, and tasks repeatedly swap out. The ECS service's events show the following.

service web-api (port 8080) is unhealthy in target-group arn:... due to (reason Health checks failed)

Diagnosis flow

① Confirm target_type = "ip" (a Fargate-specific pitfall)

If the ALB's target group is left as target_type = "instance", Fargate tasks can't be registered correctly.

aws elbv2 describe-target-groups \
  --query 'TargetGroups[*].{Name:TargetGroupName,TargetType:TargetType}'

Always confirm it's ip. If it's instance, recreate it (can't be changed).

② Confirm the health-check path and port

Is the health-check path (e.g. /healthz) set on the ALB's target group implemented in the app and returning on the correct port.

# タスクのENI IPを取得してヘルスチェックを手動実行
TASK_ENI_IP=$(aws ecs describe-tasks \
  --cluster prod --tasks <task-id> \
  --query 'tasks[0].attachments[0].details[?name==`privateIPv4Address`].value' \
  --output text)

# VPC内の踏み台から確認（直接疎通できる場合）
curl -v http://${TASK_ENI_IP}:8080/healthz

③ Is the SG blocking the health check

Even if the task's SG inbound rule is "allow only the ALB's SG," it won't pass unless it matches the source of the ALB SG's health-check sends.

# タスクSGのインバウンドルールを確認
aws ec2 describe-security-groups \
  --group-ids <task-sg-id> \
  --query 'SecurityGroups[*].IpPermissions'

④ The startPeriod and grace-period settings

An app that takes time to start (running DB migrations, loading a large model, etc.) gets judged "unhealthy" and killed during init unless you set the grace before the health check starts.

{
  "healthCheck": {
    "startPeriod": 60,
    "interval": 15,
    "timeout": 5,
    "retries": 3
  }
}

In addition, the ECS-service side also needs health_check_grace_period_seconds.

resource "aws_ecs_service" "app" {
  # ...
  health_check_grace_period_seconds = 60
}

healthCheck.startPeriod (the health-check grace of the container in the task definition) and health_check_grace_period_seconds (the grace during which the ECS service ignores ALB health-check failures) are different things. The grace after being registered with the ALB is set by the latter.

Health-check diagnosis checklist:

Is target_type = "ip"
Does the health-check path return 200 in the app
Does the health-check port number match the container port
Does the task SG allow inbound from the ALB's SG
Is the task definition's healthCheck.startPeriod set
Is the ECS service's health_check_grace_period_seconds set

SpotInterruption: a Spot interruption is not a "failure"

Symptom

stopCode: "SpotInterruption"
stoppedReason: "Your Spot Task was interrupted."

The correct understanding

A Spot interruption is a designed behavior for AWS to reclaim capacity, not a software bug or an operational mistake. What you should address is "is it designed not to break when an interruption occurs."

How a Spot interruption works:

AWS decides to reclaim capacity (usually 2 minutes before the notification)
EventBridge fires an ECS Task State Change event (stopCode=SpotInterruption)
SIGTERM is sent to the task (with the stopTimeout grace, max 120 seconds)
SIGKILL after the grace

The 3-piece set to absorb it by design:

① グレースフルシャットダウン：SIGTERMを受けてin-flightを捌き切る
② 冪等な処理：途中でKillされても、再起動後に二重処理しない
③ 容量プロバイダ戦略：baseをオンデマンドで守り、追加分をSpotに割り当て

# Spot 中断をEventBridgeで検知してアラートへ
resource "aws_cloudwatch_event_rule" "spot_interruption" {
  name        = "ecs-spot-interruption"
  event_pattern = jsonencode({
    source      = ["aws.ecs"]
    detail-type = ["ECS Task State Change"]
    detail = {
      stopCode = ["SpotInterruption"]
    }
  })
}

For the graceful-shutdown implementation pattern, see the "end cleanly on SIGTERM" section of the ECS on Fargate production-operations guide. stopTimeout's default is 30 seconds and the max is 120 seconds. In production, set it explicitly to match the app's drain time.

CannotStartContainerError / ContainerRuntimeTimeoutError

Symptom

CannotStartContainerError: ...
ContainerRuntimeTimeoutError: Timeout waiting for container to start

exitCode is null (hasn't reached container-process startup).

Cause and diagnosis

CannotStartContainerError:

A non-executable binary specified in command / entryPoint
Can't write to the volumes mount target (especially with readonlyRootFilesystem: true)
A UID specified by user doesn't exist in the container
A linuxParameters dependency setting unsupported on Fargate

# ローカルで同じ設定を再現
docker run --rm \
  --user 10001:10001 \
  --read-only \
  --tmpfs /tmp \
  my-image:tag

ContainerRuntimeTimeoutError:

The container's startup sequence (the start of running ENTRYPOINT/CMD) times out. Such as when the wait for a connection to a dependency service is too long. If the init at startup is waiting on an external service, consider extending the health check's startPeriod or decoupling external dependencies from the startup sequence.

ECS Exec: investigate inside a running container

When a task is running (or during debugging right after startup), what lets you enter directly into the container to investigate is ECS Exec. No SSH, no port opening, no key management needed.

Prerequisites

① enableExecuteCommand: true on the ECS service / task

resource "aws_ecs_service" "app" {
  enable_execute_command = true
  # ...
}

Note: enableExecuteCommand is effective only for newly-started tasks. It can't be retrofitted to existing tasks. After changing the setting, swap tasks with force-new-deployment.

② The 4 ssmmessages actions on the task role

{
  "Effect": "Allow",
  "Action": [
    "ssmmessages:CreateControlChannel",
    "ssmmessages:CreateDataChannel",
    "ssmmessages:OpenControlChannel",
    "ssmmessages:OpenDataChannel"
  ],
  "Resource": "*"
}

This is a permission to put on the task role (taskRoleArn), not the execution role.

③ The SSM Session Manager plugin (client side)

# macOS
brew install session-manager-plugin

The actual commands

# タスクIDを確認
TASK_ID=$(aws ecs list-tasks \
  --cluster prod \
  --service-name web-api \
  --query 'taskArns[0]' \
  --output text | awk -F/ '{print $NF}')

# コンテナに入る
aws ecs execute-command \
  --cluster prod \
  --task ${TASK_ID} \
  --container app \
  --interactive \
  --command "/bin/sh"

Once in the shell, you can directly check environment variables, file existence, network connectivity, and more.

# コンテナ内で環境変数の確認
env | grep DATABASE

# DB への疎通確認
nc -zv db.internal 5432

# プロセスの確認
ps aux

Auditing ECS Exec

All ECS Exec operations are recorded in CloudTrail (as ExecuteCommand API calls). Furthermore, configuring logging: OVERRIDE lets you save the session's input/output to CloudWatch Logs or S3. For access to production containers, I recommend always enabling this audit log.

For a deeper observability implementation, see Observability with OpenTelemetry × ECS.

Prevention: stop recurrence with observability

Once you've solved a problem, put in structural prevention so you don't spend time on the same problem again.

1. Structured logs + correlation ID

Output all logs as JSON and always include requestId / traceId. You can then filter in CloudWatch Logs Insights.

// 構造化ログの最小実装
const log = (level: string, msg: string, ctx: Record<string, unknown> = {}) => {
  process.stdout.write(
    JSON.stringify({ level, msg, timestamp: new Date().toISOString(), ...ctx }) + "\n"
  );
};

// リクエストIDを全ログに通す
log("info", "server:start", { port: 8080, env: process.env.NODE_ENV });

2. Container Insights + alerts

resource "aws_cloudwatch_metric_alarm" "task_stopped" {
  alarm_name          = "ecs-task-stopped-abnormally"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "RunningTaskCount"
  namespace           = "ECS/ContainerInsights"
  period              = 60
  statistic           = "Minimum"
  threshold           = 0
  alarm_description   = "全タスクが停止した（desired > 0 にもかかわらず）"
  dimensions = {
    ClusterName = "prod"
    ServiceName = "web-api"
  }
}

3. Deployment circuit breaker

When new tasks consecutively fail the health check, automatically roll back to the previous revision.

resource "aws_ecs_service" "app" {
  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }
}

This alone structurally prevents the accident of "a misdeploy stays stuck in production." Integration with the CI/CD pipeline is detailed in the ECS on Fargate CI/CD guide.

4. Always set the health check's startPeriod

{
  "healthCheck": {
    "startPeriod": 30,
    "interval": 15,
    "timeout": 5,
    "retries": 3
  }
}

The default without startPeriod is 0 seconds. A task gets dropped during a temporary unhealthy state at startup.

5. Codify the network design with Terraform

A lack of VPC endpoints is the most common cause of CannotPullContainer / ResourceInitializationError. Manage the patterns from the ECS on Fargate networking guide with Terraform and eliminate config differences between environments.

Quick reference: stop reason → where to look first → typical cause → fix

Stop reason / stopCode	Where to look first	Typical cause	Fix
`CannotPullContainerError`	VPC endpoint list, execution role	Missing NAT/VPC endpoint in a private subnet, insufficient ECR permission on the execution role	Add ECR endpoints or check NAT, grant the policy
`ResourceInitializationError`	Execution role, VPC endpoints (secretsmanager/logs/ssm)	Secret unreachable, log-write impossible, ENI-allocation failure	Add permissions to the execution role, add VPC endpoints, check subnet IP exhaustion
`EssentialContainerExited` (exitCode≠0)	CloudWatch Logs	App-startup failure, missing env/secrets, wrong CMD	Check the error content in logs, fix env vars and command
`EssentialContainerExited` (exitCode=137)	Container Insights memory graph	OOMKill (memory-limit overrun)	Change task memory to the next pair up, investigate a leak
`EssentialContainerExited` (exitCode=null)	describe-tasks' containers[].reason	CannotStartContainer (pre-startup failure)	Check the CMD/ENTRYPOINT path and permissions
ELB health-check failure	ALB target-group config, task SG	target_type=instance, SG misconfig, insufficient startPeriod	Change to `target_type=ip`, add SG allow, set startPeriod
`SpotInterruption`	The EventBridge event	Spot capacity reclamation (normal behavior)	Implement graceful shutdown, confirm idempotent design
`ContainerRuntimeTimeoutError`	The task definition's CMD/ENTRYPOINT	Startup processing too long, wait for a dependency service	Optimize the startup sequence, extend startPeriod
`CannotCreateVolumeError`	The task definition's volumes, EFS config	EFS mount target unreachable, SG config	Check the EFS endpoint and SG

Summary

ECS on Fargate troubleshooting is fastest with the pattern of logically narrowing down in the order stoppedReason → stopCode → exitCode → CloudWatch Logs, rather than starting from a feeling of "something's off."

From the experience I've accumulated in practice, the vast majority of "task won't start" problems consolidate into these 3.

Network reachability (a missing VPC endpoint or NAT in a private subnet)
Confusing the execution role and the task role (a permission on the wrong role)
Insufficient health-check grace (startPeriod and grace period not set)

The reason I could achieve zero double charges on the payment foundation is that I consistently had a design that crushes problems in advance with structure and observability, not "fix it after a failure occurs." If you want to stabilize a Fargate production environment fast and safely, feel free to reach out.

Where to look first: the starting point of diagnosis

How to read describe-tasks: always confirm 5 fields

Diagnostic flowchart

CannotPullContainerError: the image can't be pulled

Symptom

Per-cause checklist

Fix flow

Prevention

ResourceInitializationError: resource-initialization failure

Symptom

Cause and diagnosis

Confusing the execution role and the task role generates chain errors

OutOfMemoryError: memory-limit overrun

Symptom

Cause and diagnosis

Essential container exited / non-zero exitCode: app-startup failure

Symptom

Diagnosis procedure

Task failed ELB health checks: failure to pass the health check

Symptom

Diagnosis flow

SpotInterruption: a Spot interruption is not a "failure"

Symptom

The correct understanding

CannotStartContainerError / ContainerRuntimeTimeoutError

Symptom

Cause and diagnosis

ECS Exec: investigate inside a running container

Prerequisites

The actual commands

Auditing ECS Exec

Prevention: stop recurrence with observability

1. Structured logs + correlation ID

2. Container Insights + alerts

3. Deployment circuit breaker

4. Always set the health check's startPeriod

5. Codify the network design with Terraform

Quick reference: stop reason → where to look first → typical cause → fix

Summary

Related articles

AWS ECS on Fargate Production Operation Guide: Designing, Deploying, Costing, and Securing Serverless Containers in Real Code

ECS on Fargate Auto Scaling Complete Guide: Designing Target Tracking, Step, and the SQS Backlog Pattern at Production Quality

ECS on Fargate CI/CD Complete Guide: Shipping Safely with Native Blue/Green, CodeDeploy, and GitHub Actions (OIDC)

ECS on Fargate Cost-Optimization Complete Guide: From Understanding the Pricing Model to Graviton, Fargate Spot, and Savings Plans

Also worth reading

Azure Container Apps troubleshooting: diagnosing and fixing revision Failed/Degraded, exit code 137, probes, and image-pull failures

Azure Container Apps vs AWS ECS on Fargate: a thorough serverless-container comparison (scale-to-zero, GPU, cost, migration)

AWS ECS Fargate SRE Practical Guide: ADOT Distributed Tracing, EMF Metrics, and SLO / Error Budget / Burn-Rate Alert Design