Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

A Cloud Run bill jumps one day suddenly. "It was supposed to run on the free tier," "instances grew unlimited when a spike came," "I piled on min instances and constant billing rode on it" — these all come from not understanding the three dials of concurrency, autoscaling, and the billing model. Conversely, grasp these three and Cloud Run becomes an extremely cost-efficient platform that reconciles "fast and cheap."

While operating a broadcaster platform on GCP/Cloud Run, I used an asymmetric configuration where the production region was warmed with min instances and the secondary region was scaled to zero for DR, reconciling steady-state cost with disaster-time resilience. Cost is determined not by "endurance" but by "design."

This article, faithful to the Cloud Run official documentation, breaks down the mechanism that determines cost and shows "where you turn and how much it changes" with real code and estimates. For the big picture of production operation, see the Cloud Run production-operations guide.

⚠️ About pricing: the unit prices here are a rule of thumb as of June 2026, Tier 1 region, request-based billing. Prices get revised. Always confirm the latest actual amounts in the official Cloud Run pricing. The value of this article is in understanding the structure of "what moves the cost."

The cost formula: what are you paying for

Cloud Run (Services) billing, boiled down, is the product of three axes.

CPU time (vCPU-seconds)
Memory time (GiB-seconds)
Request count (per million)

Rule-of-thumb unit prices for Tier 1 region, request-based billing:

Item	Rule-of-thumb unit price (over free tier)
CPU	About $0.000024 / vCPU-second
Memory	About $0.0000025 / GiB-second
Requests	About $0.40 / million requests

Free tier (rule of thumb per month): about 240,000 vCPU-seconds, 450,000 GiB-seconds, 2 million requests. Furthermore, usage is rounded up to the nearest 100 milliseconds.

What's decisively important here is — what's billed is "the time an instance is up × the CPU/memory you reserved." In other words, cost is determined by "how many instances were up, for how long, reserving how much resource." So reducing the instance count (= raising concurrency) and reducing the time they're up (= scale-to-zero / appropriate idle management) are the two big levers of cost optimization.

Dial 1: concurrency — the center of cost

Concurrency is "how many requests one instance handles simultaneously."

The maximum concurrency ... is 80. The maximum value is 1000. ... a concurrency of 1 is likely to negatively affect scaling performance, because many instances will have to start up to handle a spike. (— About concurrency)

Let's see why concurrency determines cost in numbers. The case of handling 400 requests simultaneously:

Concurrency	Required instance count (approx.)	Relative cost
1	400	High (instance count = request count)
10	40
80 (default)	5	Low

The higher the concurrency, the more you can handle the same load with fewer instances = less up-instance time = cheaper. The official cost-optimization guide states it plainly too.

A higher concurrency setting lets fewer instances handle the same request volume, which can reduce costs. (— Best practices for cost-optimized Cloud Run services)

But it's not unconditional. You can raise concurrency when the app can handle requests concurrently (mostly I/O waits = DB, external API calls). Raising concurrency for processing that exhausts CPU in one request (image conversion, heavy computation) worsens latency.

Tuning guidelines:

I/O-bound (many DB/API waits) → raise concurrency to gain density (consider 80 → 100~).
CPU-bound (heavy computation) → lower concurrency, but avoid 1 (scaling performance and cost worsen).
Concurrency 1 is reasonable only as an exception: only when there's a clear reason like one request fully occupying the instance's resources, or global state being weak to concurrency.

gcloud run deploy api --concurrency 80 --region asia-northeast1 ...

Dial 2: autoscaling — the 60% target and scale-to-zero

Cloud Run automatically adjusts the instance count according to demand.

The autoscaler ... targets ... 60% CPU utilization / 60% concurrency utilization by default. ... revisions scale to zero instances when not receiving traffic ... instances are kept idle for up to 15 minutes. (— About instance autoscaling)

Increases/decreases the instance count targeting 60% utilization (both CPU utilization and concurrency utilization).
Shrinks to zero if there's no traffic (scale to zero). This is the biggest cost weapon.
After processing a request, keeps the instance for up to 15 minutes to reduce cold starts.

Always set min instances and max instances

Max instances (default 100) = the cost safety valve. Stops unlimited growth from spikes, loops, or attacks.
Min instances = cold-start insurance. Always warm to erase initial latency. But what you warm is billed constantly.

gcloud run deploy api --region asia-northeast1 \
  --min-instances 1 \    # 入口は1台温める（冷起動を消す）。0ならフルにゼロスケール
  --max-instances 10     # コストの安全弁。これ以上は増やさない

You don't need to "warm everything." In my project, I warmed only the production-region entrance users directly touch with min-instances=1, and set the secondary region and internal batch services to min-instances=0. Send paths where cold starts are tolerable to scale-to-zero, and warm only the paths where they aren't — this selective warming is the sweet spot of cost and experience.

Dial 3: the billing model — request-based vs. instance-based

This is the most misunderstood and most cost-impacting. Cloud Run has two billing modes, determined by how CPU is allocated (Billing settings).

	Request-based (default)	Instance-based
Flag	`--cpu-throttling`	`--no-cpu-throttling`
CPU allocation	Only during request processing	The instance's whole lifecycle
Billing timing	During request processing + startup/shutdown	All the time it's up
Background processing after the response	Can't (CPU is throttled)	Can
Suited traffic	Sporadic, spiky, bursty	Steady, slowly varying

The official guidance is simple.

For services with steady, slowly varying traffic, consider using instance-based billing ... services with sporadic, bursty, or spiky traffic, consider using request-based billing. (— Cost optimization best practices)

Which to choose: three judgments

Traffic is sporadic/spiky → request-based (default). CPU billing stops when idle, so it's overwhelmingly cheap for intermittent load.
Traffic is steady/high-utilization → consider instance-based. If you're always processing, it can be advantageous for lacking the overhead of throttling and restoring.
You want background processing after returning the response (async log sending, post-processing, cache update) → instance-based. With request-based, CPU is throttled after the response and post-processing doesn't run.

# 既定（リクエスト課金）：散発トラフィック向き
gcloud run deploy api --cpu-throttling --region asia-northeast1 ...

# インスタンス課金：定常トラフィック / レスポンス後のバックグラウンド処理が要る
gcloud run deploy worker --no-cpu-throttling --region asia-northeast1 ...

In Terraform, express it with resources.cpu_idle (true = request-based, false = instance-based).

resources {
  limits   = { cpu = "1", memory = "512Mi" }
  cpu_idle = true   # true=リクエスト課金（既定・散発向き） / false=インスタンス課金（定常向き）
}

Shrink cold starts

The price of scale-to-zero is the cold start. When waking from zero, initial latency stretches by the container-startup + app-initialization time. There are countermeasures in stages.

1. Warm with min instances (the most reliable)

Set --min-instances 1 or more and that portion is always up with zero cold starts. Reliable but billed constantly. The trick is to use it only on "the path users perceive."

2. Startup CPU boost

With --cpu-boost, increase CPU only during startup to speed up initialization. The additional cost is small and the effect is large — a standard setting.

gcloud run deploy api --cpu-boost --region asia-northeast1 ...

3. Choose the execution environment (gen1 cold-starts faster)

gen1 (gVisor) cold-starts faster and suits lightweight APIs. If you need full Linux compatibility or NFS/VPC performance, gen2. If cold start is the top priority, specify gen1.

4. Make the image smaller, delay initialization

A slim base image (distroless, alpine, multi-stage build) for fast startup.
Lazy initialization: don't load everything at startup; initialize when needed.
Connection reuse: create the DB/HTTP client once, outside the handler and reuse it within the instance (don't open a connection per request).

# 良い例：プール/クライアントはモジュールスコープで一度だけ生成 → インスタンス内で再利用
import httpx
client = httpx.AsyncClient(timeout=10.0)   # ハンドラの外。再利用される。

@app.get("/proxy")
async def proxy():
    r = await client.get("https://upstream.example/api")  # 接続を張り直さない
    return r.json()

Cold-start countermeasures are a two-tier of "warm (solve with money)" and "start fast (solve with design)." First speed up startup with design, and warm only the perceived paths that still fall short — this order is cost-efficient.

The break-even way of thinking: a rough estimate

To get a feel for "request-based or instance-based, which is cheaper," consider a 1 vCPU / 512 MiB service (unit prices are the rules of thumb above).

If 1 instance is up for the whole month (about 2,592,000 seconds):
- CPU: 2,592,000 sec × $0.000024 ≈ $62
- Memory: 2,592,000 sec × 0.5 GiB × $0.0000025 ≈ $3.2
- Total ≈ $65/month (over free tier, the upper-bound image of instance-based billing that needs no request billing)
On the other hand, for a sporadic service where only 1 hour of requests comes per day in total, request-based billing only charges for the time actually processed, so it's an order of magnitude cheaper (theoretically 1/24 or less).

The feel of the conclusion:

Low utilization (long idle) → request-based × scale-to-zero wins overwhelmingly.
Almost always processing (high utilization) → instance-based can be advantageous. For further steady load, committed use discounts are also a consideration.
The judgment axis is "utilization." This is the same structure as AWS Lambda's "Fargate/EC2 is cheaper if always-on," and underlies the thinking of the break-even of API vs self-hosting.

The actual amounts vary greatly by configuration, region, and traffic distribution. Always verify with the official pricing and Cloud Billing's real data. What to take home here is the judgment type of "choose the billing mode on the single axis of utilization."

Cost-optimization checklist (official best practices)

Organizing the official cost-optimization guide in a practically usable order.

Monitoring: observe billing as "a result of design"

Cost optimization isn't a one-time thing but an activity of observing and continuously tuning.

Cloud Monitoring: instance count, CPU/memory utilization, request count, latency. Consistently low utilization is a sign of right-sizing.
Budget alerts: notify on threshold overrun. Catch billing anomalies (infinite loops, attacks) early.
Recommender: receive optimization suggestions based on real traffic.
Cloud Run's Billing panel / Cost Explorer: break down which service / which revision is generating cost.

Cost is a result of "design and observation," not "endurance." I reached the asymmetric-region configuration (warm production, scale-to-zero secondary) precisely because I visualized "where how much is spent" with metrics. First observe, then design, then verify — observe this order and Cloud Run can reconcile speed and cost.

Conclusion: "fast and cheap" with three dials

Cloud Run cost is determined by three dials.

Concurrency — raise it and the instance count drops and gets cheaper (if I/O-bound; avoid 1).
Autoscaling — scale-to-zero is the biggest weapon. Reconcile "warming insurance" and "the cost safety valve" with min/max instances.
Billing model — choose by utilization. Request-based if sporadic, instance-based if steady.

And for cold starts, "start fast with design" first, "solve with money by warming" later. Finally, observe and keep tuning. With just this, Cloud Run becomes a platform where even a startup or solo developer keeps production quality without fearing the bill.

For the overall production design, the Cloud Run production-operations guide; for which compute to choose, the GCP container technology-selection guide.

If you're troubled by GCP cost design or platform building, with one person × generative AI, I accompany you fast, cheap, and safe. Based on the real-cost feel from operating a broadcaster platform on Cloud Run, let's nail down the configuration and billing design that fits your load characteristics together.

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

The cost formula: what are you paying for

Dial 1: concurrency — the center of cost

Dial 2: autoscaling — the 60% target and scale-to-zero

Always set min instances and max instances

Dial 3: the billing model — request-based vs. instance-based

Which to choose: three judgments

Shrink cold starts

1. Warm with min instances (the most reliable)

2. Startup CPU boost

3. Choose the execution environment (gen1 cold-starts faster)

4. Make the image smaller, delay initialization

The break-even way of thinking: a rough estimate

Cost-optimization checklist (official best practices)

Monitoring: observe billing as "a result of design"

Conclusion: "fast and cheap" with three dials

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

Cloud Run CI/CD: keyless, Blue/Green, and canary in real code with Cloud Build / GitHub Actions × Workload Identity

Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

Cloud Run networking and security: defense in depth with Ingress control, IAM auth, Direct VPC egress, and Cloud Armor

Also worth reading

Vercel production-operation guide: use it not as a front-end-only host but as a 'full-compute platform'

The complete Azure Container Apps autoscaling guide: scale-to-zero and event-driven with KEDA (HTTP, queue, CPU)

Azure Container Apps Production Operations Guide: Designing, Scaling, Deploying, Costing, and Securing Serverless Containers, with Real Code

The cost formula: what are you paying for

Dial 1: concurrency — the center of cost

Dial 2: autoscaling — the 60% target and scale-to-zero

Always set min instances and max instances

Dial 3: the billing model — request-based vs. instance-based

Which to choose: three judgments

Shrink cold starts

1. Warm with min instances (the most reliable)

2. Startup CPU boost

3. Choose the execution environment (gen1 cold-starts faster)

4. Make the image smaller, delay initialization

The break-even way of thinking: a rough estimate

Cost-optimization checklist (official best practices)

Monitoring: observe billing as "a result of design"

Conclusion: "fast and cheap" with three dials

Related articles

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

Cloud Run CI/CD: keyless, Blue/Green, and canary in real code with Cloud Build / GitHub Actions × Workload Identity

Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

Cloud Run networking and security: defense in depth with Ingress control, IAM auth, Direct VPC egress, and Cloud Armor

Also worth reading

Vercel production-operation guide: use it not as a front-end-only host but as a 'full-compute platform'

The complete Azure Container Apps autoscaling guide: scale-to-zero and event-driven with KEDA (HTTP, queue, CPU)

Azure Container Apps Production Operations Guide: Designing, Scaling, Deploying, Costing, and Securing Serverless Containers, with Real Code