ECS on Fargate Networking Design Complete Guide: Building awsvpc, ALB/NLB, Service Connect, and VPC Endpoints at Production Quality

In a production build of ECS on Fargate, the layer you get stuck on most is the networking layer. The task starts but won't connect to the ALB, can't pull the image from ECR, service-to-service communication is unstable—at the root of these problems is a lack of understanding of Fargate's own awsvpc networking mode.

On a Minister of Economy, Trade and Industry Award-winning lumber-distribution B2B SaaS, I designed, implemented, and operated in production a configuration of API Gateway → NLB → ALB → ECS on Fargate (221 endpoints, private-subnet operation). In stabilizing this stack, networking design was the biggest differentiating factor.

This article shows, end-to-end, the design decisions and Terraform implementation for building at production quality, from the essence of awsvpc through ALB/NLB connection, security-group chaining, private-subnet design, VPC-endpoint isolation, to service-to-service communication with Service Connect. For the Fargate basics (task definitions, deployment, cost, security), read the ECS on Fargate production-operations guide first. This piece specializes in the networking layer.

The whole picture: where does a request pass through?

First, grasp the actual request path with an architecture diagram.

Internet
      │
      ▼
┌─────────────────────────────────────────────┐
│  Public Subnet (ap-northeast-1a / 1c)       │
│  ┌──────────────┐    ┌──────────────────┐   │
│  │  ALB         │    │  NAT Gateway     │   │
│  │  (SG: alb)   │    │  (exit for the   │   │
│  └──────┬───────┘    │   private        │   │
│         │           │   subnet)        │   │
└─────────│───────────└──────────────────┘───┘
          │                    ▲
          │ target_type=ip     │ outbound traffic
          ▼                    │
┌─────────────────────────────────────────────┐
│  Private Subnet (ap-northeast-1a / 1c)      │
│  ┌──────────────────────────────────────┐   │
│  │  ECS Task (ENI + Private IP)         │   │
│  │  SG: task (from alb-sg:8080 only)   │   │
│  │  ┌────────────┐  ┌────────────────┐ │   │
│  │  │  app:8080  │  │ sidecar(Envoy) │ │   │
│  │  └────────────┘  └────────────────┘ │   │
│  └──────────────────────────────────────┘   │
│                                             │
│  VPC Endpoints (Interface / Gateway)        │
│  ecr.api / ecr.dkr / s3 / logs /           │
│  secretsmanager / ssmmessages              │
└─────────────────────────────────────────────┘
          │
          ▼
  AWS services (ECR / CloudWatch / Secrets Manager)

In the lumber-distribution SaaS, API Gateway → NLB is added even before this (see the lumber-industry-dx case study), separating the responsibilities of external publishing and internal L7 routing. This article targets the ECS networking layer from the ALB onward.

The essence of awsvpc: what does it mean for a task to have an ENI?

Fargate is fixed to networkMode: awsvpc. You can't choose the other networking modes (bridge, host). This is not a constraint but a structural consequence of the per-task isolation boundary Fargate provides.

Each Fargate task has its own isolation boundary and does not share the underlying kernel, CPU resources, memory resources, or elastic network interface with another task.（— AWS Fargate）

In awsvpc mode:

A dedicated ENI (Elastic Network Interface) is assigned per task
The ENI is given a private IP address within the subnet
There is no host port mapping. There's no concept like EC2's bridge mode of "forward the host's port 80 to the container's port 8080"

This consequence of "no host port mapping" directly bears on the ALB configuration.

Why the ALB must be target_type="ip"

When registering a normal EC2 instance as an ALB target, you use target_type="instance". This sends traffic by instance ID and the EC2-side port forwarding does the rest.

A Fargate task has no concept of a host. The task exists only as a private IP tied to an ENI. So the ALB must register targets by IP address directly. This is the reason for target_type="ip".

resource "aws_lb_target_group" "app" {
  # ...
  target_type = "ip"   # Fargate では必ずこれ。"instance" は機能しない
}

Leaving it as target_type="instance" makes ALB target registration fail, or the task starts but the target stays unhealthy. It's a representative mistake you get stuck on in production.

Load-balancer connection: how to choose ALB vs NLB

A Fargate service supports ALB, NLB, and GWLB. The practical decision criteria are as follows.

Aspect	ALB (L7)	NLB (L4)
Protocol	HTTP/HTTPS	TCP / UDP / TLS
Routing	Path-based, host header, HTTP method	IP + port only
SSL termination	The ALB handles it	NLB pass-through or TLS termination
WebSocket	Supported	Supported (TCP)
UDP	Not supported	Supported (PV 1.4+)
Static IP	Not supported (DNS only)	Supported (can assign EIP)
Typical use	REST API, web app	gRPC, games, IoT, internal NLB→ALB multi-tier

In the lumber-distribution SaaS, I adopt a two-tier configuration of NLB → ALB → ECS. It's a pattern of assigning a static IP to the NLB to make it API Gateway's private integration endpoint, and doing HTTP routing on the ALB. For a service that's mainly a REST API, a single ALB tier is usually enough.

Complete ALB Terraform: LB + target group + SG chain + service

# ── ALB 本体 ──────────────────────────────────────────────────────────────

resource "aws_lb" "app" {
  name               = "prod-alb"
  internal           = false   # パブリック向け。内部 ALB なら true
  load_balancer_type = "application"
  subnets            = var.public_subnet_ids
  security_groups    = [aws_security_group.alb.id]

  # アクセスログを S3 へ（本番必須）
  access_logs {
    bucket  = var.alb_log_bucket
    prefix  = "alb/prod"
    enabled = true
  }
}

# ── ALB セキュリティグループ ───────────────────────────────────────────────

resource "aws_security_group" "alb" {
  name_prefix = "prod-alb-"
  vpc_id      = var.vpc_id
  lifecycle { create_before_destroy = true }
}

# インターネットから HTTPS を受ける
resource "aws_vpc_security_group_ingress_rule" "alb_https" {
  security_group_id = aws_security_group.alb.id
  ip_protocol       = "tcp"
  from_port         = 443
  to_port           = 443
  cidr_ipv4         = "0.0.0.0/0"
}

resource "aws_vpc_security_group_ingress_rule" "alb_https_v6" {
  security_group_id = aws_security_group.alb.id
  ip_protocol       = "tcp"
  from_port         = 443
  to_port           = 443
  cidr_ipv6         = "::/0"
}

# ALB → タスクへのアウトバウンド（タスクの SG と対になる）
resource "aws_vpc_security_group_egress_rule" "alb_to_tasks" {
  security_group_id            = aws_security_group.alb.id
  referenced_security_group_id = aws_security_group.task.id
  ip_protocol                  = "tcp"
  from_port                    = 8080
  to_port                      = 8080
}

# ── タスクセキュリティグループ：ALB の SG からのみ受ける ──────────────────

resource "aws_security_group" "task" {
  name_prefix = "prod-task-"
  vpc_id      = var.vpc_id
  lifecycle { create_before_destroy = true }
}

# インバウンド：ALB の SG からアプリポートのみ。0.0.0.0/0 は開けない
resource "aws_vpc_security_group_ingress_rule" "task_from_alb" {
  security_group_id            = aws_security_group.task.id
  referenced_security_group_id = aws_security_group.alb.id
  ip_protocol                  = "tcp"
  from_port                    = 8080
  to_port                      = 8080
}

# アウトバウンド：外向き全開（NAT 経由で ECR / CloudWatch / Secrets Manager へ）
# VPCエンドポイントを使う場合は HTTPS(443) のみに絞ってもよい
resource "aws_vpc_security_group_egress_rule" "task_egress" {
  security_group_id = aws_security_group.task.id
  ip_protocol       = "-1"
  cidr_ipv4         = "0.0.0.0/0"
}

# ── ターゲットグループ：Fargate は必ず target_type = "ip" ─────────────────

resource "aws_lb_target_group" "app" {
  name        = "prod-app"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"   # ← Fargate の核心。絶対に "instance" にしない

  # 接続ドレイン（タスク交代時の in-flight を待つ時間）
  # デフォルト 300 秒は過剰。stopTimeout と合わせて短くする
  deregistration_delay = 60

  health_check {
    path                = "/healthz"
    protocol            = "HTTP"
    port                = "traffic-port"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
    matcher             = "200"
  }
}

# ── HTTPS リスナー ─────────────────────────────────────────────────────────

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.app.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = var.acm_certificate_arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

# HTTP → HTTPS リダイレクト
resource "aws_lb_listener" "http_redirect" {
  load_balancer_arn = aws_lb.app.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"
    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

# ── ECS サービス ───────────────────────────────────────────────────────────

resource "aws_ecs_service" "app" {
  name             = "web-api"
  cluster          = var.cluster_id
  task_definition  = var.task_definition_arn
  desired_count    = 2
  launch_type      = "FARGATE"
  platform_version = "LATEST"

  # デプロイ中は常に 2 タスク健全に保ち、最大 4 タスクまで起動してから旧版を落とす
  deployment_minimum_healthy_percent = 100
  deployment_maximum_percent         = 200

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  # プライベートサブネットに配置。パブリック IP は不要
  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.task.id]
    assign_public_ip = false   # プライベートサブネット + NAT の場合は false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }

  # デプロイ直後のヘルスチェック誤検知を防ぐ猶予時間
  # コンテナの startPeriod と合わせて設定する
  health_check_grace_period_seconds = 30

  enable_execute_command = true   # ECS Exec によるブレークグラス
}

The roles of `health_check_grace_period_seconds` and `deregistration_delay`

These two are often confused, but they're in different phases.

health_check_grace_period_seconds: the grace period before the ALB starts health-checking right after a task starts. Prevents the task being judged unhealthy and force-killed before the app's initialization (establishing DB connections, cache warm-up) completes.
deregistration_delay: the time to wait for in-flight processing when deregistering a task from the ALB target. It fires on every deploy, scale-in, and task stop. The default is 300 seconds, but since most APIs finish in seconds, it's excessive. Aligning it with stopTimeout to 30–60 seconds is realistic.

Security-group chaining: a design that doesn't open 0.0.0.0/0

The most important security design in Fargate networking is security-group chaining.

[Internet]
      │ TCP 443
      ▼
[alb-sg]  ── egress: only task-sg:8080
      │ TCP 8080
      ▼
[task-sg] ── ingress: allow only from alb-sg
      │
[task ENI:8080]

The point is to use SG references (referenced_security_group_id) rather than CIDRs of a single SG. A CIDR-based allow causes config gaps when IPs change, but an SG reference dynamically allows traffic from "resources with that SG attached," so it automatically follows the ALB scaling out (adding IPs) too.

You must not open 0.0.0.0/0 on the task's SG inbound. Do so, and even though you're in a private subnet, other resources within the VPC can access it directly.

The same principle applies when there are multiple services communicating with each other. Chain "service A's SG → service B's SG (app port only)." Using Service Connect described later, this SG design can be further organized.

Subnet design: private + NAT vs public direct placement

Private subnet + NAT Gateway (the production standard)

The production standard is the pattern of placing tasks in a private subnet and routing outbound traffic via a NAT Gateway.

Private subnet
  └── ECS task (assign_public_ip = false)
        └── → NAT Gateway (public subnet)
              └── → Internet (ECR / CloudWatch / Secrets Manager)

The benefit is minimizing the attack surface. Because the task has no public IP, it can't be reached directly from the internet. Only requests via the ALB arrive.

Public direct placement (`assign_public_ip = ENABLED`)

In dev environments or prototypes, there's also the choice of placing tasks in a public subnet with assign_public_ip = true. Because the task gets a public IP, it can reach ECR and CloudWatch without a NAT. There's no NAT Gateway cost (data processing fee + hourly charge).

But it's not recommended for production. The task's public IP is directly exposed to the internet, and the moment you misconfigure an SG the risk grows. Also, because Fargate's platform requires a public subnet to start with assign_public_ip = ENABLED, the VPC design becomes complex.

The decision guideline: production is private + NAT, the only choice. If the NAT cost worries you, take the direction of reducing it with the VPC endpoints described later.

VPC endpoints: reduce the NAT, or isolate

When a Fargate task in a private subnet pulls an image from ECR or sends logs to CloudWatch, by default it communicates with public endpoints via the NAT Gateway. Set up VPC endpoints and you can close this communication within the VPC.

The list of VPC endpoints Fargate needs

Endpoint	Type	Use	Necessity
`com.amazonaws.<region>.ecr.api`	Interface	ECR API (fetching image metadata)	Required when using ECR
`com.amazonaws.<region>.ecr.dkr`	Interface	Pulling image layers from ECR (Docker Registry API)	Required when using ECR
`com.amazonaws.<region>.s3`	Gateway	ECR image layers are stored in S3. ecr.dkr alone is insufficient	Required when using ECR
`com.amazonaws.<region>.logs`	Interface	Sending to CloudWatch Logs via the `awslogs` driver	Required when using CloudWatch Logs
`com.amazonaws.<region>.secretsmanager`	Interface	Secrets Manager injection via `secrets[].valueFrom`	Required when injecting secrets
`com.amazonaws.<region>.ssmmessages`	Interface	ECS Exec (via SSM Session Manager)	Required when ECS Exec is enabled

Caveat: if you set up only the ECR endpoints (ecr.api, ecr.dkr) and omit the S3 gateway endpoint, image-layer pulls go via the NAT. Because ECR image layers are stored in S3, the S3 gateway is required as a set with the ECR endpoints.

The Fargate task's own ECS control-plane communication (ecs, ecs-agent, ecs-telemetry) works without endpoints, but ECR, Secrets Manager, and CloudWatch Logs traffic flows to public endpoints unless you set them up.

Terraform: the full set of VPC endpoints

# ── Interface エンドポイント用 SG ─────────────────────────────────────────

resource "aws_security_group" "vpc_endpoints" {
  name_prefix = "vpc-endpoints-"
  vpc_id      = var.vpc_id
  lifecycle { create_before_destroy = true }
}

# プライベートサブネットからのみ HTTPS を受け付ける
resource "aws_vpc_security_group_ingress_rule" "endpoint_https" {
  security_group_id = aws_security_group.vpc_endpoints.id
  ip_protocol       = "tcp"
  from_port         = 443
  to_port           = 443
  cidr_ipv4         = var.vpc_cidr
}

resource "aws_vpc_security_group_egress_rule" "endpoint_egress" {
  security_group_id = aws_security_group.vpc_endpoints.id
  ip_protocol       = "-1"
  cidr_ipv4         = "0.0.0.0/0"
}

# ── S3 ゲートウェイエンドポイント（ECR レイヤ pull に必須） ──────────────

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = var.vpc_id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = var.private_route_table_ids
}

# ── Interface エンドポイント群 ────────────────────────────────────────────

locals {
  interface_endpoints = [
    "ecr.api",
    "ecr.dkr",
    "logs",
    "secretsmanager",
    "ssmmessages",
  ]
}

resource "aws_vpc_endpoint" "interface" {
  for_each = toset(local.interface_endpoints)

  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.${var.region}.${each.value}"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true   # DNS 解決を VPC 内で完結させる
}

private_dns_enabled = true is important. This resolves public DNS names like ecr.ap-northeast-1.amazonaws.com to the in-VPC ENI IPs, so you can isolate without changing the app or task definition at all.

Does the NAT Gateway become unnecessary?

Even setting up all the VPC endpoints, you can't necessarily abolish the NAT Gateway completely. If the task calls non-AWS external APIs (Stripe, SendGrid, etc.), you still need the NAT. If your use case is "purely AWS services only," a NAT-less isolated configuration is possible, but it's a rare case.

The realistic decision: the production standard is to keep the NAT Gateway while dropping ECR, CloudWatch, and Secrets Manager traffic internally with VPC endpoints, reducing the NAT's data-processing cost and bandwidth dependence.

Service-to-service communication: ECS Service Connect (recommended)

In a microservices configuration, you need internal communication where service A calls service B. There are several choices.

Method	Mechanism	Pros	Cons
Internal ALB	Stand up an internal ALB for each service	Can use the ALB's routing features	More resources, cost, management complexity
Cloud Map (DNS)	DNS resolution with a Route 53 private hosted zone	Simple	No retries, timeouts, or metrics
Service Connect	Inject Envoy as a sidecar, resolve by logical name	DNS + automatic retries + metrics. ECS-managed	Premised on communication within the same ECS cluster

Service Connect is recommended. It achieves loose coupling without adding internal ALBs, and Envoy automatically collects connection metrics and feeds them to the CloudWatch Namespace AWS/ECS/ManagedScaling, also improving observability.

How Service Connect works

Amazon ECS Service Connect provides management of service-to-service communication as Amazon ECS configuration. It builds both service discovery and a service mesh in Amazon ECS.（— ECS Service Connect）

Service Connect has ECS automatically inject an Envoy proxy as a sidecar (you don't need to explicitly add it to the container definition). The client-side task can call by logical name (http://order-api:8080/orders), and Envoy handles name resolution, load balancing, retries, and timeouts.

Configuration in Terraform + task definition

Service Connect is configured in the ECS service's service_connect_configuration block.

# ── サービス A（order-api）が Service Connect でサービスを公開する ──────────

resource "aws_ecs_service" "order_api" {
  name             = "order-api"
  cluster          = var.cluster_id
  task_definition  = var.order_api_task_definition_arn
  desired_count    = 2
  launch_type      = "FARGATE"
  platform_version = "LATEST"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.task.id]
    assign_public_ip = false
  }

  service_connect_configuration {
    enabled   = true
    namespace = var.cloud_map_namespace_arn   # 事前に aws_service_discovery_http_namespace で作成

    # このサービスが公開するエンドポイント
    service {
      port_name      = "http"            # タスク定義の portMappings[].name と一致させる
      discovery_name = "order-api"       # DNS 名として使われる論理名
      client_alias {
        port     = 8080
        dns_name = "order-api"           # 他タスクはこれで到達できる
      }
    }
  }
}

# ── サービス B（inventory-api）が order-api を呼ぶ側 ──────────────────────

resource "aws_ecs_service" "inventory_api" {
  name             = "inventory-api"
  cluster          = var.cluster_id
  task_definition  = var.inventory_api_task_definition_arn
  desired_count    = 2
  launch_type      = "FARGATE"
  platform_version = "LATEST"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.task.id]
    assign_public_ip = false
  }

  service_connect_configuration {
    enabled   = true
    namespace = var.cloud_map_namespace_arn

    # クライアント側は service ブロックを省略（公開しない場合）
    # 自動注入された Envoy が order-api:8080 への通信を仲介する
  }
}

On the task-definition side, you need to give portMappings a name.

{
  "containerDefinitions": [
    {
      "name": "app",
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp",
          "name": "http",
          "appProtocol": "http"
        }
      ]
    }
  ]
}

With this, when you throw an HTTP request to http://order-api:8080/orders from inside inventory-api, Envoy acts as an interceptor and automatically handles connection pooling, retries, and timeouts.

Organize service-to-service SGs with Service Connect

Even when using Service Connect, SG design is still needed. When services communicate within the same cluster, allow the relevant port from the sending task's SG to the receiving task's SG.

# inventory-api → order-api への通信を許可
resource "aws_vpc_security_group_ingress_rule" "order_from_inventory" {
  security_group_id            = aws_security_group.order_task.id
  referenced_security_group_id = aws_security_group.inventory_task.id
  ip_protocol                  = "tcp"
  from_port                    = 8080
  to_port                      = 8080
}

When to use Cloud Map instead

Cloud Map is simple DNS-based service discovery using a Route 53 private hosted zone. It updates A records each time an ECS service starts/stops.

If simple DNS resolution alone is enough with no need for retries, timeouts, or metrics, Cloud Map suffices
If you want connection observability, circuit breakers, or fine-grained timeout control, choose Service Connect

For a new build, I recommend Service Connect. It has more configuration than Cloud Map, but Envoy's metrics (connection count, error rate, latency) are directly usable for production observation.

Caveats when using an NLB

When you need L4 communication (gRPC, WebSocket over TCP, UDP) or a static IP, use an NLB. Let me summarize the Fargate + NLB-specific caveats.

resource "aws_lb" "internal" {
  name               = "prod-nlb"
  internal           = true
  load_balancer_type = "network"
  subnets            = var.private_subnet_ids
}

resource "aws_lb_target_group" "nlb_app" {
  name        = "nlb-app"
  port        = 8080
  protocol    = "TCP"
  vpc_id      = var.vpc_id
  target_type = "ip"   # NLB でも Fargate は ip 必須

  deregistration_delay = 30

  health_check {
    protocol            = "HTTP"
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 2
    interval            = 10
  }
}

When using UDP, platform_version = "LATEST" (= 1.4.0) is required. Earlier platform versions don't support UDP.

Also, because an NLB preserves the client IP, you need to allow traffic from the NLB's subnet CIDR on the task's SG (since you can't do an SG reference like with an ALB, it's a CIDR allow).

# NLB のクライアント IP 保持によりサブネット CIDR を許可する
resource "aws_vpc_security_group_ingress_rule" "task_from_nlb_cidr" {
  security_group_id = aws_security_group.task.id
  ip_protocol       = "tcp"
  from_port         = 8080
  to_port           = 8080
  cidr_ipv4         = var.vpc_cidr   # NLB を置いたサブネットの CIDR
}

To filter traffic with a WAF, you need a combination with an ALB. For details, see the WAF defense-in-depth guide.

Design checklist

Items to always confirm before releasing to production.

awsvpc / ALB basics

Is the task definition's networkMode awsvpc (fixed for Fargate, but confirm explicitly)
Is the ALB/NLB target group's target_type = "ip"
Do the ALB and Fargate task have separate SGs, chained with SG references
Is there no 0.0.0.0/0 on the task's SG inbound
Is health_check_grace_period_seconds set, accounting for the app's initialization time
Is deregistration_delay shortened to align with stopTimeout (the default 300 seconds is excessive)

Subnet / routing

Are tasks placed in a private subnet with assign_public_ip = false
Does a NAT Gateway exist in each AZ (a single-AZ NAT is a SPOF)
Does the private route table have a route to the NAT Gateway

VPC endpoints

If using ECR, are the 3-piece set of ecr.api, ecr.dkr, and the S3 gateway in place
If using CloudWatch Logs, is there a logs endpoint
If using secrets[].valueFrom, is there a secretsmanager endpoint
If enabling ECS Exec, is there an ssmmessages endpoint
Does the Interface endpoints' SG allow TCP 443 from the private subnet CIDR
Is private_dns_enabled = true

Service Connect / service-to-service

If there are multiple services, are they resolved with Service Connect or Cloud Map without adding internal ALBs
If using Service Connect, is a Cloud Map namespace (HTTP namespace) created
Are name and appProtocol set on the task definition's portMappings
Is mutual communication allowed between services' SGs (chained with SG references)

Summary

Fargate's networking unfolds entirely from the single point of awsvpc.

awsvpc + target_type=ip is the absolute rule to grasp first
With SG chaining (ALB SG → task SG), never open 0.0.0.0/0 to the task
Private subnet + NAT is the safe side for production. Isolate communication to major AWS services with VPC endpoints to cut cost and attack surface
With Service Connect, make service-to-service communication loosely coupled and prevent the proliferation of internal ALBs

The reason I can stably operate a 221-endpoint lumber-distribution SaaS with API Gateway → NLB → ALB → ECS on Fargate is that I'm thorough with this design at each layer. Build the networking layer correctly, and you can clearly separate app-layer problems from infra-layer problems, and the speed of troubleshooting goes up too.

For cost optimization (Spot, Graviton, Savings Plans), see the ECS on Fargate cost-optimization guide; for investigating the cause when a task stops, the ECS on Fargate troubleshooting guide. The full configuration of this award-winning SaaS on this portfolio is introduced in detail in the lumber-industry-dx case study. If you'd like to move forward together on designing and building a Fargate production foundation, please reach out from there.

ECS on Fargate Networking Design Complete Guide: Building awsvpc, ALB/NLB, Service Connect, and VPC Endpoints at Production Quality

The whole picture: where does a request pass through?

The essence of awsvpc: what does it mean for a task to have an ENI?

Why the ALB must be target_type="ip"

Load-balancer connection: how to choose ALB vs NLB

Complete ALB Terraform: LB + target group + SG chain + service

The roles of `health_check_grace_period_seconds` and `deregistration_delay`

Security-group chaining: a design that doesn't open 0.0.0.0/0

Subnet design: private + NAT vs public direct placement

Private subnet + NAT Gateway (the production standard)

Public direct placement (`assign_public_ip = ENABLED`)

VPC endpoints: reduce the NAT, or isolate

The list of VPC endpoints Fargate needs

Terraform: the full set of VPC endpoints

Does the NAT Gateway become unnecessary?

Service-to-service communication: ECS Service Connect (recommended)

How Service Connect works

Configuration in Terraform + task definition

Organize service-to-service SGs with Service Connect

When to use Cloud Map instead

Caveats when using an NLB

Design checklist

Summary

AWS ECS on Fargate Production Operation Guide: Designing, Deploying, Costing, and Securing Serverless Containers in Real Code

ECS on Fargate Auto Scaling Complete Guide: Designing Target Tracking, Step, and the SQS Backlog Pattern at Production Quality

ECS on Fargate CI/CD Complete Guide: Shipping Safely with Native Blue/Green, CodeDeploy, and GitHub Actions (OIDC)

ECS on Fargate Cost-Optimization Complete Guide: From Understanding the Pricing Model to Graviton, Fargate Spot, and Savings Plans

Also worth reading

AWS ECS on Fargate vs EKS: 7 Evaluation Axes a Startup Should Decide in 3 Months, and an Implementation-Cost Comparison

Cloud Run networking and security: defense in depth with Ingress control, IAM auth, Direct VPC egress, and Cloud Armor

AWS ECS Fargate SRE Practical Guide: ADOT Distributed Tracing, EMF Metrics, and SLO / Error Budget / Burn-Rate Alert Design

The whole picture: where does a request pass through?

The essence of awsvpc: what does it mean for a task to have an ENI?

Why the ALB must be target_type="ip"

Load-balancer connection: how to choose ALB vs NLB

Complete ALB Terraform: LB + target group + SG chain + service

The roles of health_check_grace_period_seconds and deregistration_delay

Security-group chaining: a design that doesn't open 0.0.0.0/0

Subnet design: private + NAT vs public direct placement

Private subnet + NAT Gateway (the production standard)

Public direct placement (assign_public_ip = ENABLED)

VPC endpoints: reduce the NAT, or isolate

The list of VPC endpoints Fargate needs

Terraform: the full set of VPC endpoints

Does the NAT Gateway become unnecessary?

Service-to-service communication: ECS Service Connect (recommended)

How Service Connect works

Configuration in Terraform + task definition

Organize service-to-service SGs with Service Connect

When to use Cloud Map instead

Caveats when using an NLB

Design checklist

Summary

Related articles

AWS ECS on Fargate Production Operation Guide: Designing, Deploying, Costing, and Securing Serverless Containers in Real Code

ECS on Fargate Auto Scaling Complete Guide: Designing Target Tracking, Step, and the SQS Backlog Pattern at Production Quality

ECS on Fargate CI/CD Complete Guide: Shipping Safely with Native Blue/Green, CodeDeploy, and GitHub Actions (OIDC)

ECS on Fargate Cost-Optimization Complete Guide: From Understanding the Pricing Model to Graviton, Fargate Spot, and Savings Plans

Also worth reading

AWS ECS on Fargate vs EKS: 7 Evaluation Axes a Startup Should Decide in 3 Months, and an Implementation-Cost Comparison

Cloud Run networking and security: defense in depth with Ingress control, IAM auth, Direct VPC egress, and Cloud Armor

AWS ECS Fargate SRE Practical Guide: ADOT Distributed Tracing, EMF Metrics, and SLO / Error Budget / Burn-Rate Alert Design

The roles of `health_check_grace_period_seconds` and `deregistration_delay`

Public direct placement (`assign_public_ip = ENABLED`)