In a production build of ECS on Fargate, the layer you get stuck on most is the networking layer. The task starts but won't connect to the ALB, can't pull the image from ECR, service-to-service communication is unstable—at the root of these problems is a lack of understanding of Fargate's own awsvpc networking mode.
On a Minister of Economy, Trade and Industry Award-winning lumber-distribution B2B SaaS, I designed, implemented, and operated in production a configuration of API Gateway → NLB → ALB → ECS on Fargate (221 endpoints, private-subnet operation). In stabilizing this stack, networking design was the biggest differentiating factor.
This article shows, end-to-end, the design decisions and Terraform implementation for building at production quality, from the essence of awsvpc through ALB/NLB connection, security-group chaining, private-subnet design, VPC-endpoint isolation, to service-to-service communication with Service Connect. For the Fargate basics (task definitions, deployment, cost, security), read the ECS on Fargate production-operations guide first. This piece specializes in the networking layer.
The whole picture: where does a request pass through?
First, grasp the actual request path with an architecture diagram.
Internet
│
▼
┌─────────────────────────────────────────────┐
│ Public Subnet (ap-northeast-1a / 1c) │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ ALB │ │ NAT Gateway │ │
│ │ (SG: alb) │ │ (exit for the │ │
│ └──────┬───────┘ │ private │ │
│ │ │ subnet) │ │
└─────────│───────────└──────────────────┘───┘
│ ▲
│ target_type=ip │ outbound traffic
▼ │
┌─────────────────────────────────────────────┐
│ Private Subnet (ap-northeast-1a / 1c) │
│ ┌──────────────────────────────────────┐ │
│ │ ECS Task (ENI + Private IP) │ │
│ │ SG: task (from alb-sg:8080 only) │ │
│ │ ┌────────────┐ ┌────────────────┐ │ │
│ │ │ app:8080 │ │ sidecar(Envoy) │ │ │
│ │ └────────────┘ └────────────────┘ │ │
│ └──────────────────────────────────────┘ │
│ │
│ VPC Endpoints (Interface / Gateway) │
│ ecr.api / ecr.dkr / s3 / logs / │
│ secretsmanager / ssmmessages │
└─────────────────────────────────────────────┘
│
▼
AWS services (ECR / CloudWatch / Secrets Manager)
In the lumber-distribution SaaS, API Gateway → NLB is added even before this (see the lumber-industry-dx case study), separating the responsibilities of external publishing and internal L7 routing. This article targets the ECS networking layer from the ALB onward.
The essence of awsvpc: what does it mean for a task to have an ENI?
Fargate is fixed to networkMode: awsvpc. You can't choose the other networking modes (bridge, host). This is not a constraint but a structural consequence of the per-task isolation boundary Fargate provides.
Each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.(— AWS Fargate)
In awsvpc mode:
- A dedicated ENI (Elastic Network Interface) is assigned per task
- The ENI is given a private IP address within the subnet
- There is no host port mapping. There's no concept like EC2's
bridgemode of "forward the host's port 80 to the container's port 8080"
This consequence of "no host port mapping" directly bears on the ALB configuration.
Why the ALB must be target_type="ip"
When registering a normal EC2 instance as an ALB target, you use target_type="instance". This sends traffic by instance ID and the EC2-side port forwarding does the rest.
A Fargate task has no concept of a host. The task exists only as a private IP tied to an ENI. So the ALB must register targets by IP address directly. This is the reason for target_type="ip".
resource "aws_lb_target_group" "app" {
# ...
target_type = "ip" # Fargate では必ずこれ。"instance" は機能しない
}
Leaving it as target_type="instance" makes ALB target registration fail, or the task starts but the target stays unhealthy. It's a representative mistake you get stuck on in production.
Load-balancer connection: how to choose ALB vs NLB
A Fargate service supports ALB, NLB, and GWLB. The practical decision criteria are as follows.
| Aspect | ALB (L7) | NLB (L4) |
|---|---|---|
| Protocol | HTTP/HTTPS | TCP / UDP / TLS |
| Routing | Path-based, host header, HTTP method | IP + port only |
| SSL termination | The ALB handles it | NLB pass-through or TLS termination |
| WebSocket | Supported | Supported (TCP) |
| UDP | Not supported | Supported (PV 1.4+) |
| Static IP | Not supported (DNS only) | Supported (can assign EIP) |
| Typical use | REST API, web app | gRPC, games, IoT, internal NLB→ALB multi-tier |
In the lumber-distribution SaaS, I adopt a two-tier configuration of NLB → ALB → ECS. It's a pattern of assigning a static IP to the NLB to make it API Gateway's private integration endpoint, and doing HTTP routing on the ALB. For a service that's mainly a REST API, a single ALB tier is usually enough.
Complete ALB Terraform: LB + target group + SG chain + service
# ── ALB 本体 ──────────────────────────────────────────────────────────────
resource "aws_lb" "app" {
name = "prod-alb"
internal = false # パブリック向け。内部 ALB なら true
load_balancer_type = "application"
subnets = var.public_subnet_ids
security_groups = [aws_security_group.alb.id]
# アクセスログを S3 へ(本番必須)
access_logs {
bucket = var.alb_log_bucket
prefix = "alb/prod"
enabled = true
}
}
# ── ALB セキュリティグループ ───────────────────────────────────────────────
resource "aws_security_group" "alb" {
name_prefix = "prod-alb-"
vpc_id = var.vpc_id
lifecycle { create_before_destroy = true }
}
# インターネットから HTTPS を受ける
resource "aws_vpc_security_group_ingress_rule" "alb_https" {
security_group_id = aws_security_group.alb.id
ip_protocol = "tcp"
from_port = 443
to_port = 443
cidr_ipv4 = "0.0.0.0/0"
}
resource "aws_vpc_security_group_ingress_rule" "alb_https_v6" {
security_group_id = aws_security_group.alb.id
ip_protocol = "tcp"
from_port = 443
to_port = 443
cidr_ipv6 = "::/0"
}
# ALB → タスクへのアウトバウンド(タスクの SG と対になる)
resource "aws_vpc_security_group_egress_rule" "alb_to_tasks" {
security_group_id = aws_security_group.alb.id
referenced_security_group_id = aws_security_group.task.id
ip_protocol = "tcp"
from_port = 8080
to_port = 8080
}
# ── タスクセキュリティグループ:ALB の SG からのみ受ける ──────────────────
resource "aws_security_group" "task" {
name_prefix = "prod-task-"
vpc_id = var.vpc_id
lifecycle { create_before_destroy = true }
}
# インバウンド:ALB の SG からアプリポートのみ。0.0.0.0/0 は開けない
resource "aws_vpc_security_group_ingress_rule" "task_from_alb" {
security_group_id = aws_security_group.task.id
referenced_security_group_id = aws_security_group.alb.id
ip_protocol = "tcp"
from_port = 8080
to_port = 8080
}
# アウトバウンド:外向き全開(NAT 経由で ECR / CloudWatch / Secrets Manager へ)
# VPCエンドポイントを使う場合は HTTPS(443) のみに絞ってもよい
resource "aws_vpc_security_group_egress_rule" "task_egress" {
security_group_id = aws_security_group.task.id
ip_protocol = "-1"
cidr_ipv4 = "0.0.0.0/0"
}
# ── ターゲットグループ:Fargate は必ず target_type = "ip" ─────────────────
resource "aws_lb_target_group" "app" {
name = "prod-app"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip" # ← Fargate の核心。絶対に "instance" にしない
# 接続ドレイン(タスク交代時の in-flight を待つ時間)
# デフォルト 300 秒は過剰。stopTimeout と合わせて短くする
deregistration_delay = 60
health_check {
path = "/healthz"
protocol = "HTTP"
port = "traffic-port"
healthy_threshold = 2
unhealthy_threshold = 3
interval = 15
timeout = 5
matcher = "200"
}
}
# ── HTTPS リスナー ─────────────────────────────────────────────────────────
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.app.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = var.acm_certificate_arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}
# HTTP → HTTPS リダイレクト
resource "aws_lb_listener" "http_redirect" {
load_balancer_arn = aws_lb.app.arn
port = 80
protocol = "HTTP"
default_action {
type = "redirect"
redirect {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
}
# ── ECS サービス ───────────────────────────────────────────────────────────
resource "aws_ecs_service" "app" {
name = "web-api"
cluster = var.cluster_id
task_definition = var.task_definition_arn
desired_count = 2
launch_type = "FARGATE"
platform_version = "LATEST"
# デプロイ中は常に 2 タスク健全に保ち、最大 4 タスクまで起動してから旧版を落とす
deployment_minimum_healthy_percent = 100
deployment_maximum_percent = 200
deployment_circuit_breaker {
enable = true
rollback = true
}
# プライベートサブネットに配置。パブリック IP は不要
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.task.id]
assign_public_ip = false # プライベートサブネット + NAT の場合は false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "app"
container_port = 8080
}
# デプロイ直後のヘルスチェック誤検知を防ぐ猶予時間
# コンテナの startPeriod と合わせて設定する
health_check_grace_period_seconds = 30
enable_execute_command = true # ECS Exec によるブレークグラス
}
The roles of health_check_grace_period_seconds and deregistration_delay
These two are often confused, but they're in different phases.
health_check_grace_period_seconds: the grace period before the ALB starts health-checking right after a task starts. Prevents the task being judgedunhealthyand force-killed before the app's initialization (establishing DB connections, cache warm-up) completes.deregistration_delay: the time to wait for in-flight processing when deregistering a task from the ALB target. It fires on every deploy, scale-in, and task stop. The default is 300 seconds, but since most APIs finish in seconds, it's excessive. Aligning it withstopTimeoutto 30–60 seconds is realistic.
Security-group chaining: a design that doesn't open 0.0.0.0/0
The most important security design in Fargate networking is security-group chaining.
[Internet]
│ TCP 443
▼
[alb-sg] ── egress: only task-sg:8080
│ TCP 8080
▼
[task-sg] ── ingress: allow only from alb-sg
│
[task ENI:8080]
The point is to use SG references (referenced_security_group_id) rather than CIDRs of a single SG. A CIDR-based allow causes config gaps when IPs change, but an SG reference dynamically allows traffic from "resources with that SG attached," so it automatically follows the ALB scaling out (adding IPs) too.
You must not open 0.0.0.0/0 on the task's SG inbound. Do so, and even though you're in a private subnet, other resources within the VPC can access it directly.
The same principle applies when there are multiple services communicating with each other. Chain "service A's SG → service B's SG (app port only)." Using Service Connect described later, this SG design can be further organized.
Subnet design: private + NAT vs public direct placement
Private subnet + NAT Gateway (the production standard)
The production standard is the pattern of placing tasks in a private subnet and routing outbound traffic via a NAT Gateway.
Private subnet
└── ECS task (assign_public_ip = false)
└── → NAT Gateway (public subnet)
└── → Internet (ECR / CloudWatch / Secrets Manager)
The benefit is minimizing the attack surface. Because the task has no public IP, it can't be reached directly from the internet. Only requests via the ALB arrive.
Public direct placement (assign_public_ip = ENABLED)
In dev environments or prototypes, there's also the choice of placing tasks in a public subnet with assign_public_ip = true. Because the task gets a public IP, it can reach ECR and CloudWatch without a NAT. There's no NAT Gateway cost (data processing fee + hourly charge).
But it's not recommended for production. The task's public IP is directly exposed to the internet, and the moment you misconfigure an SG the risk grows. Also, because Fargate's platform requires a public subnet to start with assign_public_ip = ENABLED, the VPC design becomes complex.
The decision guideline: production is private + NAT, the only choice. If the NAT cost worries you, take the direction of reducing it with the VPC endpoints described later.
VPC endpoints: reduce the NAT, or isolate
When a Fargate task in a private subnet pulls an image from ECR or sends logs to CloudWatch, by default it communicates with public endpoints via the NAT Gateway. Set up VPC endpoints and you can close this communication within the VPC.
The list of VPC endpoints Fargate needs
| Endpoint | Type | Use | Necessity |
|---|---|---|---|
com.amazonaws.<region>.ecr.api | Interface | ECR API (fetching image metadata) | Required when using ECR |
com.amazonaws.<region>.ecr.dkr | Interface | Pulling image layers from ECR (Docker Registry API) | Required when using ECR |
com.amazonaws.<region>.s3 | Gateway | ECR image layers are stored in S3. ecr.dkr alone is insufficient | Required when using ECR |
com.amazonaws.<region>.logs | Interface | Sending to CloudWatch Logs via the awslogs driver | Required when using CloudWatch Logs |
com.amazonaws.<region>.secretsmanager | Interface | Secrets Manager injection via secrets[].valueFrom | Required when injecting secrets |
com.amazonaws.<region>.ssmmessages | Interface | ECS Exec (via SSM Session Manager) | Required when ECS Exec is enabled |
Caveat: if you set up only the ECR endpoints (
ecr.api,ecr.dkr) and omit the S3 gateway endpoint, image-layer pulls go via the NAT. Because ECR image layers are stored in S3, the S3 gateway is required as a set with the ECR endpoints.
The Fargate task's own ECS control-plane communication (
ecs,ecs-agent,ecs-telemetry) works without endpoints, but ECR, Secrets Manager, and CloudWatch Logs traffic flows to public endpoints unless you set them up.
Terraform: the full set of VPC endpoints
# ── Interface エンドポイント用 SG ─────────────────────────────────────────
resource "aws_security_group" "vpc_endpoints" {
name_prefix = "vpc-endpoints-"
vpc_id = var.vpc_id
lifecycle { create_before_destroy = true }
}
# プライベートサブネットからのみ HTTPS を受け付ける
resource "aws_vpc_security_group_ingress_rule" "endpoint_https" {
security_group_id = aws_security_group.vpc_endpoints.id
ip_protocol = "tcp"
from_port = 443
to_port = 443
cidr_ipv4 = var.vpc_cidr
}
resource "aws_vpc_security_group_egress_rule" "endpoint_egress" {
security_group_id = aws_security_group.vpc_endpoints.id
ip_protocol = "-1"
cidr_ipv4 = "0.0.0.0/0"
}
# ── S3 ゲートウェイエンドポイント(ECR レイヤ pull に必須) ──────────────
resource "aws_vpc_endpoint" "s3" {
vpc_id = var.vpc_id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = var.private_route_table_ids
}
# ── Interface エンドポイント群 ────────────────────────────────────────────
locals {
interface_endpoints = [
"ecr.api",
"ecr.dkr",
"logs",
"secretsmanager",
"ssmmessages",
]
}
resource "aws_vpc_endpoint" "interface" {
for_each = toset(local.interface_endpoints)
vpc_id = var.vpc_id
service_name = "com.amazonaws.${var.region}.${each.value}"
vpc_endpoint_type = "Interface"
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true # DNS 解決を VPC 内で完結させる
}
private_dns_enabled = true is important. This resolves public DNS names like ecr.ap-northeast-1.amazonaws.com to the in-VPC ENI IPs, so you can isolate without changing the app or task definition at all.
Does the NAT Gateway become unnecessary?
Even setting up all the VPC endpoints, you can't necessarily abolish the NAT Gateway completely. If the task calls non-AWS external APIs (Stripe, SendGrid, etc.), you still need the NAT. If your use case is "purely AWS services only," a NAT-less isolated configuration is possible, but it's a rare case.
The realistic decision: the production standard is to keep the NAT Gateway while dropping ECR, CloudWatch, and Secrets Manager traffic internally with VPC endpoints, reducing the NAT's data-processing cost and bandwidth dependence.
Service-to-service communication: ECS Service Connect (recommended)
In a microservices configuration, you need internal communication where service A calls service B. There are several choices.
| Method | Mechanism | Pros | Cons |
|---|---|---|---|
| Internal ALB | Stand up an internal ALB for each service | Can use the ALB's routing features | More resources, cost, management complexity |
| Cloud Map (DNS) | DNS resolution with a Route 53 private hosted zone | Simple | No retries, timeouts, or metrics |
| Service Connect | Inject Envoy as a sidecar, resolve by logical name | DNS + automatic retries + metrics. ECS-managed | Premised on communication within the same ECS cluster |
Service Connect is recommended. It achieves loose coupling without adding internal ALBs, and Envoy automatically collects connection metrics and feeds them to the CloudWatch Namespace AWS/ECS/ManagedScaling, also improving observability.
How Service Connect works
Amazon ECS Service Connect provides management of service-to-service communication as Amazon ECS configuration. It builds both service discovery and a service mesh in Amazon ECS.(— ECS Service Connect)
Service Connect has ECS automatically inject an Envoy proxy as a sidecar (you don't need to explicitly add it to the container definition). The client-side task can call by logical name (http://order-api:8080/orders), and Envoy handles name resolution, load balancing, retries, and timeouts.
Configuration in Terraform + task definition
Service Connect is configured in the ECS service's service_connect_configuration block.
# ── サービス A(order-api)が Service Connect でサービスを公開する ──────────
resource "aws_ecs_service" "order_api" {
name = "order-api"
cluster = var.cluster_id
task_definition = var.order_api_task_definition_arn
desired_count = 2
launch_type = "FARGATE"
platform_version = "LATEST"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.task.id]
assign_public_ip = false
}
service_connect_configuration {
enabled = true
namespace = var.cloud_map_namespace_arn # 事前に aws_service_discovery_http_namespace で作成
# このサービスが公開するエンドポイント
service {
port_name = "http" # タスク定義の portMappings[].name と一致させる
discovery_name = "order-api" # DNS 名として使われる論理名
client_alias {
port = 8080
dns_name = "order-api" # 他タスクはこれで到達できる
}
}
}
}
# ── サービス B(inventory-api)が order-api を呼ぶ側 ──────────────────────
resource "aws_ecs_service" "inventory_api" {
name = "inventory-api"
cluster = var.cluster_id
task_definition = var.inventory_api_task_definition_arn
desired_count = 2
launch_type = "FARGATE"
platform_version = "LATEST"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.task.id]
assign_public_ip = false
}
service_connect_configuration {
enabled = true
namespace = var.cloud_map_namespace_arn
# クライアント側は service ブロックを省略(公開しない場合)
# 自動注入された Envoy が order-api:8080 への通信を仲介する
}
}
On the task-definition side, you need to give portMappings a name.
{
"containerDefinitions": [
{
"name": "app",
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp",
"name": "http",
"appProtocol": "http"
}
]
}
]
}
With this, when you throw an HTTP request to http://order-api:8080/orders from inside inventory-api, Envoy acts as an interceptor and automatically handles connection pooling, retries, and timeouts.
Organize service-to-service SGs with Service Connect
Even when using Service Connect, SG design is still needed. When services communicate within the same cluster, allow the relevant port from the sending task's SG to the receiving task's SG.
# inventory-api → order-api への通信を許可
resource "aws_vpc_security_group_ingress_rule" "order_from_inventory" {
security_group_id = aws_security_group.order_task.id
referenced_security_group_id = aws_security_group.inventory_task.id
ip_protocol = "tcp"
from_port = 8080
to_port = 8080
}
When to use Cloud Map instead
Cloud Map is simple DNS-based service discovery using a Route 53 private hosted zone. It updates A records each time an ECS service starts/stops.
- If simple DNS resolution alone is enough with no need for retries, timeouts, or metrics, Cloud Map suffices
- If you want connection observability, circuit breakers, or fine-grained timeout control, choose Service Connect
For a new build, I recommend Service Connect. It has more configuration than Cloud Map, but Envoy's metrics (connection count, error rate, latency) are directly usable for production observation.
Caveats when using an NLB
When you need L4 communication (gRPC, WebSocket over TCP, UDP) or a static IP, use an NLB. Let me summarize the Fargate + NLB-specific caveats.
resource "aws_lb" "internal" {
name = "prod-nlb"
internal = true
load_balancer_type = "network"
subnets = var.private_subnet_ids
}
resource "aws_lb_target_group" "nlb_app" {
name = "nlb-app"
port = 8080
protocol = "TCP"
vpc_id = var.vpc_id
target_type = "ip" # NLB でも Fargate は ip 必須
deregistration_delay = 30
health_check {
protocol = "HTTP"
path = "/healthz"
healthy_threshold = 2
unhealthy_threshold = 2
interval = 10
}
}
When using UDP, platform_version = "LATEST" (= 1.4.0) is required. Earlier platform versions don't support UDP.
Also, because an NLB preserves the client IP, you need to allow traffic from the NLB's subnet CIDR on the task's SG (since you can't do an SG reference like with an ALB, it's a CIDR allow).
# NLB のクライアント IP 保持によりサブネット CIDR を許可する
resource "aws_vpc_security_group_ingress_rule" "task_from_nlb_cidr" {
security_group_id = aws_security_group.task.id
ip_protocol = "tcp"
from_port = 8080
to_port = 8080
cidr_ipv4 = var.vpc_cidr # NLB を置いたサブネットの CIDR
}
To filter traffic with a WAF, you need a combination with an ALB. For details, see the WAF defense-in-depth guide.
Design checklist
Items to always confirm before releasing to production.
awsvpc / ALB basics
- Is the task definition's
networkModeawsvpc(fixed for Fargate, but confirm explicitly) - Is the ALB/NLB target group's
target_type = "ip" - Do the ALB and Fargate task have separate SGs, chained with SG references
- Is there no
0.0.0.0/0on the task's SG inbound - Is
health_check_grace_period_secondsset, accounting for the app's initialization time - Is
deregistration_delayshortened to align withstopTimeout(the default 300 seconds is excessive)
Subnet / routing
- Are tasks placed in a private subnet with
assign_public_ip = false - Does a NAT Gateway exist in each AZ (a single-AZ NAT is a SPOF)
- Does the private route table have a route to the NAT Gateway
VPC endpoints
- If using ECR, are the 3-piece set of
ecr.api,ecr.dkr, and the S3 gateway in place - If using CloudWatch Logs, is there a
logsendpoint - If using
secrets[].valueFrom, is there asecretsmanagerendpoint - If enabling ECS Exec, is there an
ssmmessagesendpoint - Does the Interface endpoints' SG allow TCP 443 from the private subnet CIDR
- Is
private_dns_enabled = true
Service Connect / service-to-service
- If there are multiple services, are they resolved with Service Connect or Cloud Map without adding internal ALBs
- If using Service Connect, is a Cloud Map namespace (HTTP namespace) created
- Are
nameandappProtocolset on the task definition'sportMappings - Is mutual communication allowed between services' SGs (chained with SG references)
Summary
Fargate's networking unfolds entirely from the single point of awsvpc.
awsvpc+target_type=ipis the absolute rule to grasp first- With SG chaining (ALB SG → task SG), never open
0.0.0.0/0to the task - Private subnet + NAT is the safe side for production. Isolate communication to major AWS services with VPC endpoints to cut cost and attack surface
- With Service Connect, make service-to-service communication loosely coupled and prevent the proliferation of internal ALBs
The reason I can stably operate a 221-endpoint lumber-distribution SaaS with API Gateway → NLB → ALB → ECS on Fargate is that I'm thorough with this design at each layer. Build the networking layer correctly, and you can clearly separate app-layer problems from infra-layer problems, and the speed of troubleshooting goes up too.
For cost optimization (Spot, Graviton, Savings Plans), see the ECS on Fargate cost-optimization guide; for investigating the cause when a task stops, the ECS on Fargate troubleshooting guide. The full configuration of this award-winning SaaS on this portfolio is introduced in detail in the lumber-industry-dx case study. If you'd like to move forward together on designing and building a Fargate production foundation, please reach out from there.