Skip to main content
友田 陽大
Google Cloud Run in production
GCP
Cloud Run
コスト最適化
オートスケール
サーバーレス
パフォーマンス
インフラ
FinOps

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

An explanation, faithful to the official spec, of the three factors that determine Cloud Run cost — concurrency (default 80, max 1000), autoscaling (60% utilization target, scale-to-zero), and the billing model (request-based vs. instance-based). It systematizes, with gcloud/Terraform real code: cold-start countermeasures (min instances, startup CPU boost, gen1/gen2, slim images), break-even estimation, and a cost-optimization checklist.

Published
Reading time
11 min read
Author
友田 陽大
Share

A Cloud Run bill jumps one day suddenly. "It was supposed to run on the free tier," "instances grew unlimited when a spike came," "I piled on min instances and constant billing rode on it" — these all come from not understanding the three dials of concurrency, autoscaling, and the billing model. Conversely, grasp these three and Cloud Run becomes an extremely cost-efficient platform that reconciles "fast and cheap."

While operating a broadcaster platform on GCP/Cloud Run, I used an asymmetric configuration where the production region was warmed with min instances and the secondary region was scaled to zero for DR, reconciling steady-state cost with disaster-time resilience. Cost is determined not by "endurance" but by "design."

This article, faithful to the Cloud Run official documentation, breaks down the mechanism that determines cost and shows "where you turn and how much it changes" with real code and estimates. For the big picture of production operation, see the Cloud Run production-operations guide.

⚠️ About pricing: the unit prices here are a rule of thumb as of June 2026, Tier 1 region, request-based billing. Prices get revised. Always confirm the latest actual amounts in the official Cloud Run pricing. The value of this article is in understanding the structure of "what moves the cost."


The cost formula: what are you paying for

Cloud Run (Services) billing, boiled down, is the product of three axes.

  1. CPU time (vCPU-seconds)
  2. Memory time (GiB-seconds)
  3. Request count (per million)

Rule-of-thumb unit prices for Tier 1 region, request-based billing:

ItemRule-of-thumb unit price (over free tier)
CPUAbout $0.000024 / vCPU-second
MemoryAbout $0.0000025 / GiB-second
RequestsAbout $0.40 / million requests

Free tier (rule of thumb per month): about 240,000 vCPU-seconds, 450,000 GiB-seconds, 2 million requests. Furthermore, usage is rounded up to the nearest 100 milliseconds.

What's decisively important here is — what's billed is "the time an instance is up × the CPU/memory you reserved." In other words, cost is determined by "how many instances were up, for how long, reserving how much resource." So reducing the instance count (= raising concurrency) and reducing the time they're up (= scale-to-zero / appropriate idle management) are the two big levers of cost optimization.


Dial 1: concurrency — the center of cost

Concurrency is "how many requests one instance handles simultaneously."

The maximum concurrency ... is 80. The maximum value is 1000. ... a concurrency of 1 is likely to negatively affect scaling performance, because many instances will have to start up to handle a spike. (— About concurrency)

Let's see why concurrency determines cost in numbers. The case of handling 400 requests simultaneously:

ConcurrencyRequired instance count (approx.)Relative cost
1400High (instance count = request count)
1040
80 (default)5Low

The higher the concurrency, the more you can handle the same load with fewer instances = less up-instance time = cheaper. The official cost-optimization guide states it plainly too.

A higher concurrency setting lets fewer instances handle the same request volume, which can reduce costs. (— Best practices for cost-optimized Cloud Run services)

But it's not unconditional. You can raise concurrency when the app can handle requests concurrently (mostly I/O waits = DB, external API calls). Raising concurrency for processing that exhausts CPU in one request (image conversion, heavy computation) worsens latency.

Tuning guidelines:

  • I/O-bound (many DB/API waits) → raise concurrency to gain density (consider 80 → 100~).
  • CPU-bound (heavy computation) → lower concurrency, but avoid 1 (scaling performance and cost worsen).
  • Concurrency 1 is reasonable only as an exception: only when there's a clear reason like one request fully occupying the instance's resources, or global state being weak to concurrency.
gcloud run deploy api --concurrency 80 --region asia-northeast1 ...

Dial 2: autoscaling — the 60% target and scale-to-zero

Cloud Run automatically adjusts the instance count according to demand.

The autoscaler ... targets ... 60% CPU utilization / 60% concurrency utilization by default. ... revisions scale to zero instances when not receiving traffic ... instances are kept idle for up to 15 minutes. (— About instance autoscaling)

  • Increases/decreases the instance count targeting 60% utilization (both CPU utilization and concurrency utilization).
  • Shrinks to zero if there's no traffic (scale to zero). This is the biggest cost weapon.
  • After processing a request, keeps the instance for up to 15 minutes to reduce cold starts.

Always set min instances and max instances

  • Max instances (default 100) = the cost safety valve. Stops unlimited growth from spikes, loops, or attacks.
  • Min instances = cold-start insurance. Always warm to erase initial latency. But what you warm is billed constantly.
gcloud run deploy api --region asia-northeast1 \
  --min-instances 1 \    # 入口は1台温める(冷起動を消す)。0ならフルにゼロスケール
  --max-instances 10     # コストの安全弁。これ以上は増やさない

You don't need to "warm everything." In my project, I warmed only the production-region entrance users directly touch with min-instances=1, and set the secondary region and internal batch services to min-instances=0. Send paths where cold starts are tolerable to scale-to-zero, and warm only the paths where they aren't — this selective warming is the sweet spot of cost and experience.


Dial 3: the billing model — request-based vs. instance-based

This is the most misunderstood and most cost-impacting. Cloud Run has two billing modes, determined by how CPU is allocated (Billing settings).

Request-based (default)Instance-based
Flag--cpu-throttling--no-cpu-throttling
CPU allocationOnly during request processingThe instance's whole lifecycle
Billing timingDuring request processing + startup/shutdownAll the time it's up
Background processing after the responseCan't (CPU is throttled)Can
Suited trafficSporadic, spiky, burstySteady, slowly varying

The official guidance is simple.

For services with steady, slowly varying traffic, consider using instance-based billing ... services with sporadic, bursty, or spiky traffic, consider using request-based billing. (— Cost optimization best practices)

Which to choose: three judgments

  1. Traffic is sporadic/spikyrequest-based (default). CPU billing stops when idle, so it's overwhelmingly cheap for intermittent load.
  2. Traffic is steady/high-utilization → consider instance-based. If you're always processing, it can be advantageous for lacking the overhead of throttling and restoring.
  3. You want background processing after returning the response (async log sending, post-processing, cache update) → instance-based. With request-based, CPU is throttled after the response and post-processing doesn't run.
# 既定(リクエスト課金):散発トラフィック向き
gcloud run deploy api --cpu-throttling --region asia-northeast1 ...

# インスタンス課金:定常トラフィック / レスポンス後のバックグラウンド処理が要る
gcloud run deploy worker --no-cpu-throttling --region asia-northeast1 ...

In Terraform, express it with resources.cpu_idle (true = request-based, false = instance-based).

resources {
  limits   = { cpu = "1", memory = "512Mi" }
  cpu_idle = true   # true=リクエスト課金(既定・散発向き) / false=インスタンス課金(定常向き)
}

Shrink cold starts

The price of scale-to-zero is the cold start. When waking from zero, initial latency stretches by the container-startup + app-initialization time. There are countermeasures in stages.

1. Warm with min instances (the most reliable)

Set --min-instances 1 or more and that portion is always up with zero cold starts. Reliable but billed constantly. The trick is to use it only on "the path users perceive."

2. Startup CPU boost

With --cpu-boost, increase CPU only during startup to speed up initialization. The additional cost is small and the effect is large — a standard setting.

gcloud run deploy api --cpu-boost --region asia-northeast1 ...

3. Choose the execution environment (gen1 cold-starts faster)

gen1 (gVisor) cold-starts faster and suits lightweight APIs. If you need full Linux compatibility or NFS/VPC performance, gen2. If cold start is the top priority, specify gen1.

4. Make the image smaller, delay initialization

  • A slim base image (distroless, alpine, multi-stage build) for fast startup.
  • Lazy initialization: don't load everything at startup; initialize when needed.
  • Connection reuse: create the DB/HTTP client once, outside the handler and reuse it within the instance (don't open a connection per request).
# 良い例:プール/クライアントはモジュールスコープで一度だけ生成 → インスタンス内で再利用
import httpx
client = httpx.AsyncClient(timeout=10.0)   # ハンドラの外。再利用される。

@app.get("/proxy")
async def proxy():
    r = await client.get("https://upstream.example/api")  # 接続を張り直さない
    return r.json()

Cold-start countermeasures are a two-tier of "warm (solve with money)" and "start fast (solve with design)." First speed up startup with design, and warm only the perceived paths that still fall short — this order is cost-efficient.


The break-even way of thinking: a rough estimate

To get a feel for "request-based or instance-based, which is cheaper," consider a 1 vCPU / 512 MiB service (unit prices are the rules of thumb above).

  • If 1 instance is up for the whole month (about 2,592,000 seconds):

    • CPU: 2,592,000 sec × $0.000024 ≈ $62
    • Memory: 2,592,000 sec × 0.5 GiB × $0.0000025 ≈ $3.2
    • Total ≈ $65/month (over free tier, the upper-bound image of instance-based billing that needs no request billing)
  • On the other hand, for a sporadic service where only 1 hour of requests comes per day in total, request-based billing only charges for the time actually processed, so it's an order of magnitude cheaper (theoretically 1/24 or less).

The feel of the conclusion:

  • Low utilization (long idle)request-based × scale-to-zero wins overwhelmingly.
  • Almost always processing (high utilization)instance-based can be advantageous. For further steady load, committed use discounts are also a consideration.
  • The judgment axis is "utilization." This is the same structure as AWS Lambda's "Fargate/EC2 is cheaper if always-on," and underlies the thinking of the break-even of API vs self-hosting.

The actual amounts vary greatly by configuration, region, and traffic distribution. Always verify with the official pricing and Cloud Billing's real data. What to take home here is the judgment type of "choose the billing mode on the single axis of utilization."


Cost-optimization checklist (official best practices)

Organizing the official cost-optimization guide in a practically usable order.

  • Cap with max instances (a safety valve to stop unlimited growth from runaways, attacks, loops)
  • Right-size CPU/memory (lower if utilization is low even at peak. Judge with metrics)
  • Raise concurrency (if I/O-bound, gain density to reduce the instance count. Avoid 1)
  • Choose the billing mode by utilization (sporadic → request-based / steady → instance-based)
  • Min instances only on necessary paths (don't warm everything. Narrow to perceived paths)
  • Require authentication (--no-allow-unauthenticated. Wasted/illicit requests are cost as-is)
  • Migrate to Direct VPC egress (erase the Serverless VPC connector's resident VM / idle cost)
  • Place in the same region (avoid inter-region data-transfer cost)
  • Static content on Cloud CDN (don't let requests reach Cloud Run)
  • Make the image slim (cold-start shortening = improves both startup billing and latency)
  • Enable Budget alerts / Recommender (early-detect billing anomalies, receive optimization suggestions)

Monitoring: observe billing as "a result of design"

Cost optimization isn't a one-time thing but an activity of observing and continuously tuning.

  • Cloud Monitoring: instance count, CPU/memory utilization, request count, latency. Consistently low utilization is a sign of right-sizing.
  • Budget alerts: notify on threshold overrun. Catch billing anomalies (infinite loops, attacks) early.
  • Recommender: receive optimization suggestions based on real traffic.
  • Cloud Run's Billing panel / Cost Explorer: break down which service / which revision is generating cost.

Cost is a result of "design and observation," not "endurance." I reached the asymmetric-region configuration (warm production, scale-to-zero secondary) precisely because I visualized "where how much is spent" with metrics. First observe, then design, then verify — observe this order and Cloud Run can reconcile speed and cost.


Conclusion: "fast and cheap" with three dials

Cloud Run cost is determined by three dials.

  1. Concurrency — raise it and the instance count drops and gets cheaper (if I/O-bound; avoid 1).
  2. Autoscaling — scale-to-zero is the biggest weapon. Reconcile "warming insurance" and "the cost safety valve" with min/max instances.
  3. Billing model — choose by utilization. Request-based if sporadic, instance-based if steady.

And for cold starts, "start fast with design" first, "solve with money by warming" later. Finally, observe and keep tuning. With just this, Cloud Run becomes a platform where even a startup or solo developer keeps production quality without fearing the bill.

For the overall production design, the Cloud Run production-operations guide; for which compute to choose, the GCP container technology-selection guide.

If you're troubled by GCP cost design or platform building, with one person × generative AI, I accompany you fast, cheap, and safe. Based on the real-cost feel from operating a broadcaster platform on Cloud Run, let's nail down the configuration and billing design that fits your load characteristics together.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

I can take on the implementation from this article as an engagement

GCP / Cloud Run container platforms, from design to production and cost optimization

Building container platforms on Cloud Run (services + jobs), migration from AWS/on-prem, keyless CI/CD via Workload Identity, defense-in-depth with Cloud Armor and least privilege, and cost optimization of concurrency and the billing model. With experience building and operating a broadcaster platform on GCP with IaC, I deliver fast, cheap, and secure.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading