# You Can Halve Your Server Bill with 'Design': A Terraform × FinOps Practical Guide to Cutting a Startup's AWS Monthly Bill by 30–50%

> "The AWS bill is up again this month, too." Infrastructure costs ballooning faster than MRR growth is a design problem, not a technical one. For startup CEOs/COOs, we explain — with analogies even non-engineers understand and numeric simulations — a Terraform implementation that combines autoscaling, Fargate Spot, S3 tiering, auto-stopping non-production, and budget alerts to build a foundation that cuts the monthly bill by 30–50% while withstanding business growth.

- Published: 2026-04-18
- Author: 友田 陽大
- Tags: AWS, Terraform, インフラ, コスト最適化, FinOps, スタートアップ, IaC, 経営
- URL: https://tomodahinata.com/en/blog/aws-terraform-startup-cost-optimization-finops
- Category: Infrastructure, IaC & CI/CD
- Pillar guide: https://tomodahinata.com/en/blog/aws-ecs-vs-eks-startup-decision-framework

## Key points

- 30–50% of the AWS monthly bill is 'sleeping money,' and the cause is design, not technique
- Autoscaling lowers off-peak resource costs by 50–70% and automatically withstands sudden access too
- Fargate Spot keeps the baseline at normal pricing and makes only the added seats Spot, discounting the seat fee by up to 70% without stopping
- S3 Intelligent-Tiering automatically demotes old data to a cheaper tier, and always set a retention period for CloudWatch Logs
- Budgets' budget alerts and auto-stopping non-production environments at night/weekends structurally stop sudden billing from forgetting to turn things off

---

## Introduction: "The AWS bill is high again this month" — that feeling is correct

From startup CEOs and COOs, I get almost without fail the same consultation every year.

> "The engineers tell me 'it can't be helped,' but even in months where revenue hasn't grown that much, the AWS cost is the same or even higher. Is this normal?"

Let me say it from the conclusion. **It's not normal.** And in most cases, **30–50% of the monthly bill is "sleeping money."** This article summarizes an extremely realistic methodology to wake this "sleeping money" and turn it into the business. I always attach an analogy of "restaurant management," "warehouse operation," or "household budget" to technical terms, so I've written it so you can read to the end even without understanding the technical details.

### The "5 infrastructure wastes" common to startups

In many SaaS/web services, the following "wastes" quietly, every day, melt money.

1. **Operating at full capacity even in the dead of night**: running servers with full staff even in the dead of night when there are no customers
2. **Renting a large venue matched to the peak**: maintaining the seat count matched to the Friday-night crowd even on weekday daytimes
3. **The never-opened closet**: old data and logs you've never once taken out occupy an expensive shelf
4. **The staging environment is the same scale as production**: renting an internal-test "fitting room" the same size as the main store
5. **A line you should have canceled is still billed**: unused IP addresses and load balancers quietly billed every month

The sum of these is often seen at startups on the scale of **80,000 to 400,000 yen per month.** Converted to an annual amount, it's equivalent to **1.5–2 months of an engineer's salary**, or **half a year's ad budget.** It's money that should have been usable to grow the business.

### The solution this article presents

Declaratively implement the following 4 pillars with **Terraform (a mechanism that describes infrastructure as a blueprint)**, and structurally stop both "**the accident of billing increasing due to a misconfiguration**" and "**the waste that increases because people forget.**"

- **Pillar ①**: Autoscaling = automatically adjust the seat count according to the number of customers
- **Pillar ②**: Fargate Spot = use empty-seat tickets to discount the seat fee by up to 70%
- **Pillar ③**: S3 Intelligent-Tiering = automatically swap the warehouse by usage frequency
- **Pillar ④**: Budgets + auto-stopping non-production = automating the household budget and preventing forgotten shutdowns

---

## Body ①: Pillar ① — Autoscaling (move the seat count according to the number of customers)

Let me use a restaurant as an example. 40 seats at lunchtime, 4 seats right before closing in the dead of night. This "seat-count adjustment" is something a human manager would do as a matter of course, but in conventional server operation, **renting 40 seats 24 hours** was the norm. The 36 seats in the dead of night are the same as **incurring electricity and rent every day even though no one is sitting in them.**

AWS autoscaling realizes this "automatic seat-count adjustment." **Increase the seat count when customers (access count, CPU load) increase, and decrease it when they decrease.** During the sleeping dead of night it shrinks to the minimum seat count, and when access goes 10× from an event announcement, it automatically expands the seats.

### 3 effects pleasing from a business viewpoint

- **Reduction of wasteful constant cost**: off-peak resource costs drop by 50–70%
- **Resilience to sudden access**: doesn't go down from TV exposure or an SNS buzz → zero opportunity loss
- **Humans don't wake up at night**: don't drag engineers out of bed in the dead of night for "scale-up response" (**a foundation that makes hired people less likely to quit**)

### Implementation in Terraform (details in the later code section)

The declaration is almost the single sentence "if CPU utilization exceeds an average of 60%, add seats; if it falls below, reduce them." The value is that **this judgment automatically keeps turning 24 hours without human judgment intervening.**

---

## Body ②: Pillar ② — Fargate Spot (discount the seat fee by up to 70% with empty-seat tickets)

Imagine an airplane's "standby seat." There's a mechanism to sell a seat that was left over at the last minute at **30% of the list price**, and a customer flexible with time can ride at a bargain. AWS Fargate Spot is this same idea — a service that **lends out the compute resources AWS has left over at up to 70% discount.**

"Wait, can you get kicked out partway?" — yes, rarely. That's exactly why there's an **important design.**

**Don't make all the seats Spot.** This is the iron rule. Make the seats stably needed (the baseline) normal pricing, and make only the added seats for peak response Spot, in a **hybrid seating.** For example, "at least 2 seats normal, the added portion Spot," and even if a kick-out happens, the service doesn't stop, and the cost drops greatly.

### The expected reduction effect

- When you switch 50% of always-on to Spot: **about 25–35% of the overall infrastructure cost is reduced**
- Batch processing and nightly machine-learning jobs can often be **100% Spot** (if they stop, just re-run by morning)

---

## Body ③: Pillar ③ — S3 Intelligent-Tiering (swap the warehouse by usage frequency)

Isn't there something somewhere in your home that's been "packed in a cardboard box and put in the warehouse, unopened for 3 years"? It's the same with corporate data. **Read daily right after creation, but no one looks 3 months later.** Yet in AWS's default settings, **both the latest and the old are left in the same high-grade warehouse.**

S3 Intelligent-Tiering is a mechanism that **automatically observes the data's usage frequency, and when access decreases, moves it to a cheaper warehouse on its own.** A human doesn't need to judge — **just turn on one option.**

### Storage cost by tier (guideline)

| Storage tier | Monthly unit price (USD/GB/month) | Characteristic |
|---|---|---|
| Frequent-access tier (equivalent to normal S3) | About $0.023 | Immediate retrieval anytime |
| Infrequent-access tier (no access for 30 days) | About $0.0125 | Immediate retrieval, storage fee half |
| Archive tier (no access for 90 days) | About $0.004 | Retrieval in a few hours, **storage fee 1/5 or less** |
| Deep-archive tier (180 days) | About $0.00099 | Retrieval in 12 hours, **storage fee 1/20 or less** |

**Many old images, logs, and backups automatically descend to a cheaper warehouse and are taken out only when needed.** For 1TB of logs, the estimate is that it drops from **about $250 to around $50 per year.**

---

## Body ④: Pillar ④ — Budgets + auto-stopping non-production (the household budget and preventing forgotten shutdowns)

No matter how much you optimize, a sudden bill of 50,000 yen at month-end can occur from **a misconfiguration or a human "forgetting to turn it off."** The following 2 mechanisms prevent this.

### AWS Budgets: the household budget that notifies before going over budget

"This month, **notify Slack when you reach 80% of the budget.** Email the CEO at 90%. Emergency response at 100%." Declare this in Terraform, and **a human no longer needs to watch.**

### Auto-stopping non-production environments: turn off the lights at night and on holidays

Development and staging environments are **mostly running even in the dead of night and on weekends when no one is logged in.** Just make this "running only weekdays 9-19, stopped at night and on weekends," and **the non-production infrastructure cost is cut by about 2/3.** The calculation is that running becomes about 50 of the 168 hours in a week.

Declare this too in Terraform with Lambda + EventBridge, and **even if someone forgets, the lights reliably go off.**

---

## The technical backing: the Terraform code that declares these

The following is actual Terraform code. While guaranteeing robust quality for engineers, **each block explicitly states in comments "which business risk it prevents."** Decision-makers, please **skim just the comments.**

### ECS Fargate + autoscaling + Spot hybrid

```hcl
# ============================================================
# ビジネス目的：
#   ①ピーク時のアクセス急増で落ちない（機会損失ゼロ）
#   ②閑散時には席数を絞ってコスト最小化
#   ③最低2席は通常料金で確保しサービス停止リスクを排除
#   ④増席分はSpotで最大70%割引 → 成長してもコスト曲線を寝かせる
# ============================================================

resource "aws_ecs_cluster" "app" {
  name = "app-prod"

  # ビジネス目的：障害時の原因究明を最短化（顧客への事故報告の準備時間を1日→1時間に）
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

# Capacity Providerの配分。ここが経済性の核。
resource "aws_ecs_cluster_capacity_providers" "app" {
  cluster_name       = aws_ecs_cluster.app.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  # 通常 : Spot = 1 : 1 で配置（つまり増席分の半分がSpot）
  # ビジネス目的：席代の平均単価を下げつつ、Spot追い出しでも常時運転可能
  default_capacity_provider_strategy {
    capacity_provider = "FARGATE"
    base              = 2 # 最低2席は必ず通常Fargate（サービス停止リスクを封じる）
    weight            = 1
  }
  default_capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    base              = 0
    weight            = 1
  }
}

# ECS Service（実際に動く店舗）
resource "aws_ecs_service" "app" {
  name            = "app"
  cluster         = aws_ecs_cluster.app.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2

  # ビジネス目的：デプロイ失敗を自動ロールバック
  #   → リリース事故で顧客に迷惑をかけたあと、
  #     深夜にエンジニアが手で戻す運用をゼロにする。
  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  # 新バージョンの席を立ち上げてから旧席を畳む → ダウンタイムゼロ
  deployment_configuration {
    minimum_healthy_percent = 100
    maximum_percent         = 200
  }

  network_configuration {
    subnets          = var.private_subnet_ids # Multi-AZで単一障害点を排除
    security_groups  = [aws_security_group.app.id]
    assign_public_ip = false # 外部からの直接攻撃面をゼロに
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }

  # Capacity Providerを明示（クラスタのデフォルトを上書きする場合）
  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    base              = 2
    weight            = 1
  }
  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    base              = 0
    weight            = 1
  }

  # ビジネス目的：Terraformによる増席数の上書きを避け、オートスケーリングの決定を尊重する
  lifecycle {
    ignore_changes = [desired_count]
  }
}

# ----------- オートスケーリングポリシー（席数の自動調整） -----------
resource "aws_appautoscaling_target" "app" {
  service_namespace  = "ecs"
  resource_id        = "service/${aws_ecs_cluster.app.name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 2  # 最小席数：深夜・早朝でも2席
  max_capacity       = 20 # 最大席数：10倍アクセスにも耐える
}

# CPU使用率60%をターゲットにスケール
# ビジネス目的：
#   ・CPUが常に60%を超えないよう席数を増やす（客を待たせない）
#   ・60%を大きく下回ったら席を減らす（家賃を払いすぎない）
resource "aws_appautoscaling_policy" "cpu" {
  name               = "app-cpu-target"
  service_namespace  = aws_appautoscaling_target.app.service_namespace
  resource_id        = aws_appautoscaling_target.app.resource_id
  scalable_dimension = aws_appautoscaling_target.app.scalable_dimension
  policy_type        = "TargetTrackingScaling"

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 60.0

    # スケールアウト（増席）は素早く、スケールイン（減席）は慎重に
    # ビジネス目的：急なピークに出遅れないが、一時的な波で席を減らして後悔しない
    scale_out_cooldown = 60
    scale_in_cooldown  = 300
  }
}
```

### S3 Intelligent-Tiering (automatic warehouse swapping)

```hcl
# ============================================================
# ビジネス目的：
#   古いログ・画像・バックアップを人間の判断なしで安い倉庫へ
#   1TB規模で年間 $200〜$300 程度の削減が現実的に見込める
# ============================================================
resource "aws_s3_bucket" "assets" {
  bucket = "my-startup-assets"
}

# バージョニング：誤削除事故時の復旧手段（データ消失＝信用失墜を防ぐ）
resource "aws_s3_bucket_versioning" "assets" {
  bucket = aws_s3_bucket.assets.id
  versioning_configuration { status = "Enabled" }
}

# Intelligent-Tieringの設定：アクセス頻度で自動降格
resource "aws_s3_bucket_intelligent_tiering_configuration" "assets" {
  bucket = aws_s3_bucket.assets.id
  name   = "assets-intelligent-tiering"
  status = "Enabled"

  # 90日アクセスなしならアーカイブ層（保管費が約1/5）
  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  # 180日アクセスなしならディープアーカイブ層（保管費が約1/20）
  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

# 古いバージョン・削除マーカーの定期整理
# ビジネス目的：バージョニングで溜まる"幽霊データ"の保管費を定期的に清算
resource "aws_s3_bucket_lifecycle_configuration" "assets" {
  bucket = aws_s3_bucket.assets.id

  rule {
    id     = "expire-noncurrent-versions"
    status = "Enabled"
    filter {} # 全オブジェクトに適用

    noncurrent_version_expiration { noncurrent_days = 60 }
    abort_incomplete_multipart_upload { days_after_initiation = 7 }
  }
}
```

### Auto-stopping non-production environments (turn off the lights at night and on holidays)

An EventBridge schedule that runs the development environment's ECS service **only on weekdays 9-19** and drops it to `desired_count = 0` otherwise.

```hcl
# ============================================================
# ビジネス目的：
#   非本番環境（dev/stg）のインフラ費を約2/3カット
#   人の「消し忘れ」による月5万円規模の事故をゼロに
# ============================================================

# 停止スケジュール：平日19時（UTC 10時）に desired_count = 0
resource "aws_scheduler_schedule" "stop_dev" {
  name       = "stop-dev-service"
  group_name = "default"
  flexible_time_window { mode = "OFF" }

  # cron式：月-金の19:00 JST
  schedule_expression          = "cron(0 10 ? * MON-FRI *)"
  schedule_expression_timezone = "Asia/Tokyo"

  target {
    arn      = "arn:aws:scheduler:::aws-sdk:ecs:updateService"
    role_arn = aws_iam_role.scheduler.arn

    input = jsonencode({
      Cluster      = "app-dev"
      Service      = "app"
      DesiredCount = 0
    })
  }
}

# 起動スケジュール：平日9時に desired_count = 2 に戻す
resource "aws_scheduler_schedule" "start_dev" {
  name       = "start-dev-service"
  group_name = "default"
  flexible_time_window { mode = "OFF" }

  schedule_expression          = "cron(0 9 ? * MON-FRI *)"
  schedule_expression_timezone = "Asia/Tokyo"

  target {
    arn      = "arn:aws:scheduler:::aws-sdk:ecs:updateService"
    role_arn = aws_iam_role.scheduler.arn

    input = jsonencode({
      Cluster      = "app-dev"
      Service      = "app"
      DesiredCount = 2
    })
  }
}
```

### Budget alerts (automating the household budget)

```hcl
# ============================================================
# ビジネス目的：
#   月末まで気づかず青くなる"驚きの請求書"を撲滅
#   事故発生時のマッハ対応で、被害を「月末」ではなく「同日」で止める
# ============================================================

resource "aws_budgets_budget" "monthly" {
  name         = "monthly-cost"
  budget_type  = "COST"
  limit_amount = "2000" # USD/月
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  # 予算の80%でSlack通知（余裕がある段階で気づく）
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alert.arn]
  }

  # 90%で CEO にメール（経営判断が必要な段階）
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 90
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.ceo_email]
  }

  # 予測ベースのアラート：月末に超過見込みなら、中旬でも警告
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = [var.ceo_email]
  }
}
```

### The "design philosophy" embedded in the code (a summary for decision-makers)

What's technically important in the above code is the point that **all the settings are declared in code (Terraform).** This has effects directly tied to the business.

- **Doesn't become person-dependent**: even if a specific engineer leaves, **why this configuration exists remains as a design document (code)**
- **Reproducible**: you can **replicate the same configuration in 5 minutes** for production, staging, or another business
- **Change-reviewable**: because an infrastructure change becomes a pull request, **management can grasp a dangerous setting change in advance** (also effective for audit response)
- **Disaster recovery**: even if an entire region goes down, you can **restore from the code to another region**

This is the true value of **IaC (Infrastructure as Code).** To use a restaurant analogy, it's the transition from "an operation manual only in the manager's head" to "an official manual everyone can operate by the same standard."

---

## Business impact: "sleeping money" seen in numbers

### Simulation: the case of an MRR 5-million-yen SaaS startup

The following is the scale sense of a B2B SaaS with a similar configuration that I actually handled. The numbers are conservative estimates based on the case.

| Item | Before (the raw configuration) | After (this article's configuration) | Monthly reduction |
|---|---|---|---|
| ECS (production) 24h × 2 large instances | ¥72,000 | ¥45,000 (Autoscale + Spot) | -¥27,000 |
| ECS (dev/stg) 24h running | ¥40,000 | ¥13,000 (night/weekend stop) | -¥27,000 |
| RDS (no reservation) | ¥55,000 | ¥38,000 (1-year RI + size review) | -¥17,000 |
| S3 (all Standard) | ¥18,000 | ¥7,000 (Intelligent-Tiering) | -¥11,000 |
| CloudWatch Logs (no retention period) | ¥12,000 | ¥4,000 (30-day auto-delete) | -¥8,000 |
| Unused LB, NAT Gateway, EIP | ¥15,000 | ¥3,000 (inventory + delete) | -¥12,000 |
| **Total** | **¥212,000** | **¥110,000** | **-¥102,000 / month** |

**Annualized: about ¥1.22 million in reduction.** This is an amount you can turn to any of the following:

- **1–1.5 months of contract engineering work**
- **An add-on to half a year's web ad budget**
- **A conference booth fee + booth construction fee**
- **The outsourced production cost of onboarding videos**

### More important is the curve "when you scale 10×"

But the real value isn't "this month's bill." **When MRR becomes 10×, does the infrastructure cost become 10×, or stay at 3×** — this decides the business's gross-margin structure.

| Growth scenario | The raw configuration (linear increase) | This article's configuration (sublinear increase) | Difference |
|---|---|---|---|
| MRR 5 million yen | ¥212,000 | ¥110,000 | -¥102,000/month |
| MRR 15 million yen (3×) | ¥636,000 | ¥260,000 | -¥376,000/month |
| MRR 50 million yen (10×) | ¥2,120,000 | ¥700,000 | -¥1,420,000/month |

The monthly difference when grown 10× is **about ¥1.42 million.** A difference of **¥17 million a year.** This is the scale of being able to **additionally hire one senior engineer for a year**, and a gross-margin improvement that **directly connects to the valuation in fundraising.**

**The difference in design decides the future valuation.** This isn't a pep talk but a direct move that improves the SaaS Rule of 40 (growth rate + profit rate ≥ 40% is the guideline for healthy growth).

### The qualitative effect of reliability: the "invisible revenue" a foundation that doesn't go down produces

Let me emphasize that, at the same time as cost reduction, **availability also rises.**

- **A sharp decrease in late-night incident response**: with autoscaling and auto-rollback, **the frequency of engineers waking at night drops to 1/5 or less.** Hired talent settles in more easily.
- **Resilience to sudden access**: a foundation that doesn't go down during media exposure is insurance to not miss a **business-jump opportunity that comes a few times in a lifetime.**
- **Audit/security response**: because infrastructure changes are recorded in Git, **the change trail** required for SOC2, ISMS, and Privacy Mark response remains automatically.

---

## Conclusion: not "build and done," but leave "a mechanism that keeps dropping"

What I value most in my work is **"leaving a foundation that runs next month and after, even without me."**

Dropping this month's bill by 50,000 yen with a one-off optimization is relatively easy. But it's also a fact that, with that alone, it often reverts to square one half a year later. **Leave the team the common language of Terraform, and settle the very mechanism of cost reduction as a culture** — this is the real value I want to provide.

### What I promise as a development partner

- **I speak in the decision-maker's language**: I always explain by translating technical terms into "business risks and opportunities"
- **I take responsibility with numbers**: I visualize, in a monthly report, how many yen dropped and what % improved before and after a measure
- **I prioritize a long-term view**: I don't adopt a design that drops 50,000 yen in the short term but raises 150,000 yen 3 months later
- **I leave it in a hand-off-able form**: I leave documentation and code that don't become person-dependent and that a successor engineer can operate at the same level
- **I work backward from the user experience**: so cost reduction doesn't sacrifice screen speed or availability, I advance optimization only in the **safe zone seen from the UX**

### The next step: first, an "inventory"

Before full-scale work, just **an inventory of the current AWS usage** (analysis with AWS Cost Explorer, Trusted Advisor, Compute Optimizer) reveals the **"reduction room you find" in 1–2 weeks.** Start from here, and a way of proceeding that produces results from the first month with zero risk is possible.

"Is our configuration optimal?" "Will it withstand next year's scale as-is?" — if you feel that way, feel free to consult us, from the initial inventory. Let's together create **the first numbers** that wake "sleeping money" and turn it into the business.
