# Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

> A Cloud Run production-operations guide faithful to the Google Cloud official documentation. From the container contract (PORT/SIGTERM), concurrency (default 80, max 1000), scale-to-zero, request billing and instance billing, traffic splitting by revisions (Blue/Green, canary), health checks, least-privilege service accounts and Secret Manager, to Direct VPC egress — systematized with real gcloud, Terraform, and FastAPI/Node code.

- Published: 2026-06-28
- Author: 友田 陽大
- Tags: GCP, Cloud Run, サーバーレス, コンテナ, インフラ, コスト最適化, 可観測性, Terraform
- URL: https://tomodahinata.com/en/blog/google-cloud-run-production-guide
- Category: Google Cloud Run in production

## Key points

- Cloud Run is a fully managed serverless foundation that 'runs code, functions, or containers on Google's infrastructure.' The deploy unit is always a container image, and you can also auto-build from source with buildpacks. It has 3 forms: Services (HTTP), Jobs (run-to-completion), and Worker Pools (resident background)
- The keys to production quality are 4 points: 'the container contract ($PORT listen on 0.0.0.0, cleanup within 10 seconds on SIGTERM),' 'concurrency (default 80, max 1000) decides cost and scale,' 'using scale-to-zero and minimum instances appropriately,' and 'Blue/Green and canary with traffic splitting using revision immutability'
- Billing has 2 modes: request billing (--cpu-throttling, default) and instance billing (--no-cpu-throttling). The former bills CPU only during request processing, for sporadic traffic; the latter bills over the whole lifecycle and allows post-response background processing
- Security's first move is 'assign a least-privilege user-managed service account per service, don't use the default Compute Engine service account,' 'secrets in Secret Manager (env var = fixed at startup / volume = always latest),' and 'auth required (--no-allow-unauthenticated)'
- Request timeout is default 300 seconds, max 60 minutes. Processing exceeding this is decoupled from HTTP to Cloud Run Jobs / Cloud Workflows. Design long-running jobs to be idempotent and resumable

---

"I want to run containers in production. But I can't spare time for Kubernetes-cluster node management or patching" — when assembling a container foundation on GCP for a startup or small-team development, you almost always arrive here. The answer is **Google Cloud Run.**

I have actually **built an in-house AI platform for a major domestic broadcaster on GCP with Terraform as IaC** and handled its production operation ([case study](/case-studies/broadcaster-ai-content-platform)). The FastAPI API group, broadcast-quality speech synthesis, an OCR × speech-recognition pipeline for telop-typo detection, a ClamAV malware scanner for uploaded material — I run these on **Cloud Run services and jobs**, aggregate data in Cloud SQL / Memorystore / Firestore / Cloud Storage, place Cloud Armor at the entrance, make CI/CD **keyless with Workload Identity Federation**, and keep 1 instance always warm in the production Region while the secondary Region scales to zero for DR — all without dedicated VMs or Kubernetes.

This article aims to be **faithful to the [Cloud Run official documentation](https://docs.cloud.google.com/run/docs/overview/what-is-cloud-run) while being clearer than the official docs**, and to show "in which scene, how to use it" with real code. From the container contract, resource design, concurrency, scale, deploy, resilience, security, to cost, it covers end-to-end what's needed to ship to production.

Technology selection itself (Cloud Run or GKE or App Engine) is in the [GCP container technology-selection guide](/blog/google-cloud-run-vs-gke-app-engine-cloud-run-functions-compute-selection-guide), and the deep-dive on concurrency, billing, and cost optimization is split into the [Cloud Run auto-scale, billing, and cost-optimization guide](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide). This article concentrates on **"after choosing Cloud Run, how to build it in production."**

---

## What Cloud Run Is: The Official Definition

The official definition is simple.

> Cloud Run is a fully managed application platform for running your code, function, or container on top of Google's highly scalable infrastructure.（— [What is Cloud Run](https://docs.cloud.google.com/run/docs/overview/what-is-cloud-run)）

That is, Cloud Run is a **serverless foundation for concentrating on just running containers, leaving server configuration, OS patching, orchestration, and scaling all to the platform.** Picking up the important features from the official docs —

- **The deploy unit is always a container image.** You can build it yourself, or hand over source code (Go / Node.js / Python / Java / .NET / Ruby, etc.) and **buildpacks** auto-containerize it.
- **Any language / binary runs.** As the official docs say, "You can deploy code written in any programming language on Cloud Run if you can build a container image from it."
- **It has 3 resource forms.**

| Form | Role | Representative use |
|------|------|--------------|
| **Services** | Receive requests at a stable HTTPS endpoint, auto-scaling with traffic | REST/GraphQL API, web app, webhook receiver |
| **Jobs** | Run, finish, and stop. Manual/scheduled start, parallel tasks | Batch, DB migration, long-running bulk processing |
| **Worker Pools** | Resident background processing | A Pub/Sub pull subscriber, a Kafka consumer |

This article mainly handles **Services**, and shows "when to use each" for Jobs / Worker Pools in the latter half. In a real project, I operated with the division of HTTP APIs on Services, and heavy long-running processing like telop-typo detection and malware scanning on Jobs.

---

## When to Use It: A Glance Decision Axis (Details to the Technology-Selection Guide)

There are multiple options on GCP for "running containers." The deep comparison I leave to the [technology-selection guide](/blog/google-cloud-run-vs-gke-app-engine-cloud-run-functions-compute-selection-guide), but let me show just the first decision axis.

| Service | In one phrase | When to choose |
|---------|------------|------------|
| **Cloud Run** | A serverless container / microservice foundation | Run stateless containers **without K8s operation.** **When in doubt, here.** |
| **Cloud Run functions** (formerly Cloud Functions) | An event-driven FaaS | A **function** responding to a single trigger (HTTP/Pub/Sub/Storage, etc.). Runs on the Cloud Run foundation. |
| **GKE / GKE Autopilot** | Managed Kubernetes | **K8s-specific features** like DaemonSet, CRD, Operator, service mesh are needed. |
| **App Engine** | A legacy PaaS | An existing asset. **New is Cloud Run recommended** (described later). |
| **Compute Engine** | A VM | Can't be containerized / OS-level control or resident GPU is needed. |

The official docs (the App Engine migration guide) state clearly about new development.

> For new Google Cloud users, we recommend using Cloud Run as the preferred alternative over App Engine.（— [Compare App Engine and Cloud Run](https://docs.cloud.google.com/appengine/migration-center/run/compare-gae-with-run)）

When in doubt, Cloud Run. This is the default answer in 2026's GCP.

---

## The Container Contract (Runtime Contract): The 5 Promises to Uphold

A container loaded onto Cloud Run must satisfy the **container runtime contract.** Miss this and "it works locally but won't start in production" happens. Let me organize the most-important promises into 5.

### 1. Listen on `$PORT`・`0.0.0.0`

> The ingress container within an instance must listen for requests on `0.0.0.0` on the port to which requests are sent.（— [Container runtime contract](https://docs.cloud.google.com/run/docs/container-contract)）

The port is passed via the environment variable `PORT` (default `8080`). Listen on `localhost`/`127.0.0.1` and it's unreachable from outside, causing a startup failure. **Always listen on `0.0.0.0`, reading `$PORT`.**

```python
# FastAPI（uvicorn）。PORT を読み、0.0.0.0 で待ち受ける。
import os
import uvicorn
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def root():
    return {"ok": True}

if __name__ == "__main__":
    # 既定 8080。Cloud Run は PORT を注入するので必ず環境変数から読む。
    port = int(os.environ.get("PORT", "8080"))
    uvicorn.run(app, host="0.0.0.0", port=port)
```

```javascript
// Node.js（Express）。同じく PORT を読み、0.0.0.0 で待ち受ける。
import express from "express";

const app = express();
app.get("/", (_req, res) => res.json({ ok: true }));

const port = Number(process.env.PORT ?? 8080);
app.listen(port, "0.0.0.0", () => console.log(`listening on ${port}`));
```

### 2. Be Stateless

Instances increase, decrease, and are destroyed anytime. **Don't persist state (sessions, counters, files being uploaded) to an instance's memory or local disk.** Put state externally in Cloud SQL / Memorystore / Firestore / Cloud Storage, etc.

### 3. The File System Is In-Memory

> the in-memory filesystem ... writing too much data can crash the instance.

The writable file system is **in-memory**, consuming the instance's memory by what you write. Write a large temporary file and it crashes with OOM. Keep temporary data small, or stream it to Cloud Storage (the later malware scanner does "stream-scan without buffering" for exactly this reason).

### 4. Receive SIGTERM and Clean Up Within 10 Seconds

> Before shutting down an instance, Cloud Run sends a `SIGTERM` signal to all the containers in an instance, indicating the start of a 10 second period before the actual shutdown occurs, at which point Cloud Run sends a `SIGKILL` signal.（— [Container runtime contract](https://docs.cloud.google.com/run/docs/container-contract)）

The instance is dropped on every scale-in, deploy, or revision switch. When you receive `SIGTERM`, finish **completing in-progress requests, closing connections, and flushing buffers within 10 seconds.** The detailed code is in the [graceful shutdown](#graceful-shutdown-sigterm-and-idempotency) section.

### 5. Return a Response Within the Timeout

If the response doesn't complete within the [request timeout](#request-timeout-long-running-processing-goes-to-jobs) (default 300 seconds), the client gets a **504.** Don't hold long-running processing in synchronous HTTP; decouple it to jobs or workflows.

---

## The First Deploy: From Source or From a Container

The shortest is a source deploy. You don't even need a `Dockerfile` (buildpacks take care of it).

```bash
# ソースから直接デプロイ（buildpacksが自動でコンテナ化 → Artifact Registry → Cloud Run）
gcloud run deploy api \
  --source . \
  --region asia-northeast1 \
  --no-allow-unauthenticated   # まず認証必須で公開（後述）

# 自前ビルドのイメージからデプロイ（本番はこちらを推奨：再現性が高い）
gcloud run deploy api \
  --image asia-northeast1-docker.pkg.dev/PROJECT_ID/repo/api:GIT_SHA \
  --region asia-northeast1 \
  --no-allow-unauthenticated
```

Make a habit of attaching **`--no-allow-unauthenticated`** first. `--allow-unauthenticated` is "publish to the entire internet without auth." Make in-house tools and inter-service calls auth-required, and explicitly open only what truly needs to be public (no-auth also raises cost with wasted requests).

> In production, the standard is "build the image in CI (Cloud Build / GitHub Actions) and deploy a tagged image to Cloud Run." In my project too, I separated responsibilities — **Terraform is 'the infrastructure configuration' and Cloud Build is 'the image and the latest env'** — to prevent drift. For making the CI side keyless, see the [Workload Identity Federation article](/blog/github-actions-oidc-keyless-cicd-aws-gcp-guide).

---

## Resource Design: Understand the "Combinations" of CPU and Memory

CPU and memory can be decided independently, but **the upper/lower bounds of memory are determined per CPU value** ([Configure CPU limits](https://docs.cloud.google.com/run/docs/configuring/services/cpu)).

| vCPU | Memory range |
|------|------------|
| 0.08 | ~512 MiB |
| 0.5  | ~1 GiB |
| 1    | ~4 GiB |
| 2    | ~8 GiB |
| 4    | 2–16 GiB |
| 6    | 4–24 GiB |
| 8    | 4–32 GiB |

- **vCPU is 0.08–8.** Below 1 is a decimal in 0.001 steps (e.g. `0.25`), 1 and above is only the integers `1, 2, 4, 6, 8`.
- **Start small and right-size by metrics** is the principle. Take it large from the start and it rides directly into billing.

```bash
gcloud run deploy api \
  --image IMAGE_URL --region asia-northeast1 \
  --cpu 1 --memory 512Mi \
  --cpu-boost           # 起動時だけCPUを増やして冷起動を速くする
```

### Startup CPU boost

Attach `--cpu-boost` and it **temporarily increases the CPU only during instance startup** (e.g. 2 vCPU-equivalent during startup for 1 vCPU). It's effective for shrinking the cold start of apps with heavy JVM, Node, or Python initialization, and it's a standard setting with a large effect for the additional cost.

### Execution Environments: gen1 and gen2

Cloud Run has 2 generations of execution environments ([About execution environments](https://docs.cloud.google.com/run/docs/about-execution-environments)).

| | **gen1** | **gen2** |
|---|---------|---------|
| Foundation | gVisor | microVM |
| Cold start | **Fast** | Somewhat slower for some services |
| Linux compatibility | Emulates many syscalls (some unsupported) | **Full Linux compatibility** |
| Network file system | × | **○ (NFS, etc.)** |
| CPU/network performance | Standard | **Fast** |
| Memory lower bound | Below 512 MiB possible | 512 MiB or above |

- The default is **unspecified (the platform auto-selects).**
- **gen1 for cold-start-first, lightweight APIs**, **gen2 for full Linux compatibility, NFS, VPC egress, CPU-intensive workloads.**
- **Jobs and Worker Pools are always gen2.**

```bash
gcloud run deploy api --execution-environment gen2 --region asia-northeast1 ...
```

---

## Concurrency: The Number of Requests One Instance Handles Simultaneously

Cloud Run's most-important parameter is **concurrency.** It decides "up to how many requests one instance handles simultaneously."

> The maximum concurrency ... is `80` (Console) / 80 times the number of vCPUs (CLI/Terraform). The maximum value is `1000`.（— [About concurrency](https://docs.cloud.google.com/run/docs/about-concurrency)）

- **The default is 80** (in gcloud/Terraform, vCPU count × 80 is the upper-bound default). **The max is 1000.**
- The lower the concurrency, **the more instances are needed to handle the same load** = more cold starts, and cost tends to rise.
- The official docs state plainly that "**concurrency `1` significantly degrades scale performance** (many instances will have to start up to handle a spike)." It gets weak to spikes.
- If the app uses a lot of CPU/memory per request, lower the concurrency; if IO-wait is heavy (DB, external API), raise the concurrency to gain density — that's the tuning.

```bash
gcloud run deploy api --concurrency 80 --region asia-northeast1 ...
```

How concurrency **moves scale and billing** is explained, down to the unit-cost calculation, in the dedicated article [concurrency, auto-scale, billing](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide). Here, just remember "concurrency is the central dial of performance and cost."

---

## Auto-Scale: Scale-to-Zero and Minimum/Maximum Instances

Cloud Run **shrinks to zero (scale to zero) when no requests come**, and increases automatically when they do. The brain of scaling is —

> The autoscaler ... targets ... 60% CPU utilization / 60% concurrency utilization by default.（— [About instance autoscaling](https://docs.cloud.google.com/run/docs/about-instance-autoscaling)）

- **Targets 60% utilization by default** to adjust the instance count (both CPU utilization and concurrency utilization).
- After request processing, **it keeps instances for up to 15 minutes (10 minutes for GPU)** to reduce cold starts.
- With **minimum instances (min instances)** you keep them warm to erase cold starts. With **maximum instances (max instances, default 100)** you cap the cost on a runaway.

```bash
gcloud run deploy api \
  --min-instances 1 \      # 本番の入口は1台温めて冷起動を消す
  --max-instances 10 \     # コストの安全弁。スパイクでも10台で頭打ち
  --region asia-northeast1 ...
```

> In a real project, I made an asymmetric configuration of **keeping the production Region warm with min-instances=1 and the secondary Region for DR with min-instances=0 (scale to zero)**, curbing normal-time cost while ensuring resilience on failure. You don't need to "keep all Regions warm."

The scale design (cold-start countermeasures, how to decide min/max, the meaning of the 60% target) is deep-dived in the [auto-scale article](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide).

---

## Request Timeout: Long-Running Processing Goes to Jobs

> Default timeout: 5 minutes (300 seconds). Maximum timeout: 60 minutes (3,600 seconds).（— [Request timeout](https://docs.cloud.google.com/run/docs/configuring/request-timeout)）

- **Default 300 seconds, max 60 minutes.** You can also specify a duration like `--timeout 1m20s`.
- **Don't hold processing exceeding this (video processing, large batches, an LLM's long inference) in synchronous HTTP.** Even if the client disconnects, the processing can't be stopped, and a retry causes multiple execution too.

The right answer is to **separate reception (Service) and execution (Job/Workflow).**

```bash
# 長時間処理は Cloud Run Jobs に切り出す（HTTPから切り離す）
gcloud run jobs create telop-ocr \
  --image IMAGE_URL --region asia-northeast1 \
  --tasks 10 --parallelism 5 \       # 10タスクを最大5並列で
  --max-retries 3 --task-timeout 3600s
gcloud run jobs execute telop-ocr --region asia-northeast1
```

> My telop-typo-detection pipeline was exactly this form. The HTTP API only "starts a job and immediately returns a reception ID," and the heavy OCR and speech-recognition processing runs in **parallel with Cloud Run Jobs + Cloud Workflows.** Progress is delivered near-real-time to the UI via Firestore snapshot subscription + SSE, achieving **sequential 18 minutes → parallel 13 minutes (about 30% reduction).** Always design long-running processing to be "idempotent and resumable" (number segment IDs deterministically so the result converges uniquely even on a re-run).

---

## Health Checks: startup and liveness

Cloud Run has 2 kinds of probes ([Configure health checks](https://docs.cloud.google.com/run/docs/configuring/healthchecks)).

- **Startup probe**: judges startup completion. Doesn't flow traffic until it succeeds. A new service by default is a **TCP probe to the container port** (`timeoutSeconds: 240` / `periodSeconds: 240` / `failureThreshold: 1`).
- **Liveness probe**: continuous monitoring after startup. **Failure restarts the container** (if it doesn't succeed within `failureThreshold × periodSeconds`, SIGKILL → start a new instance).

An HTTP probe is **2XX/3XX = success**, otherwise failure. Implement a `/healthz` in the app and **lightly return just "am I alive"** as basic (do a heavy dependency-target check every time and the probe clogs, causing a chain restart).

```python
from fastapi import FastAPI, Response
app = FastAPI()

@app.get("/healthz")
def liveness():
    # liveness は「自分のプロセスが応答可能か」だけを軽く返す。
    # 依存先（DB/Redis）の不調で再起動ループに入れないため、依存チェックは入れない。
    return {"status": "ok"}
```

A Terraform configuration example is in the [IaC section later](#iac-build-it-declaratively-with-terraform). For "a slow-starting app," the right answer is to widen the startup probe's `failure_threshold × period_seconds` to **sufficient grace for startup** (because the default TCP probe presupposes almost-immediate success).

---

## Graceful Shutdown: SIGTERM and Idempotency

Per the container contract, it's **SIGKILL 10 seconds after SIGTERM.** In these 10 seconds, finish "completing in-progress requests," "closing connections," and "flushing buffers."

```python
# FastAPI（uvicorn）。SIGTERM を捕まえて後始末する。
import signal, asyncio, logging
from contextlib import asynccontextmanager
from fastapi import FastAPI

log = logging.getLogger("app")

@asynccontextmanager
async def lifespan(app: FastAPI):
    # 起動時：プールやクライアントを確保
    app.state.pool = await create_pool()
    yield
    # 終了時（SIGTERM経由でlifespanのshutdownが走る）：確実に閉じる
    log.info("draining: closing pool within the 10s grace window")
    await app.state.pool.close()

app = FastAPI(lifespan=lifespan)
```

```javascript
// Node.js。SIGTERM で新規受付を止め、処理中を待ってから終了。
const server = app.listen(Number(process.env.PORT ?? 8080), "0.0.0.0");

process.on("SIGTERM", () => {
  console.log("SIGTERM received: draining connections");
  server.close(async () => {
    await pool.end();          // DBプールを閉じる
    process.exit(0);           // 10秒以内に抜ける
  });
});
```

What's essentially important here is **idempotency.** Processing that progressed partway on SIGTERM can be **retried and multiply-executed on another instance.** The same principle I thoroughly applied in the [payment foundation with 0 double charges in production](/case-studies/payment-platform-reliability) — **make multiple execution structurally impossible with an idempotency key + unique constraint, so "the same result no matter how many times the same operation comes"** — I uphold on Cloud Run too. "Closing carefully in the SIGTERM handler" alone is insufficient; it's safe only once **the processing side is idempotent.** The design of idempotent async processing is also helped by [SQS/Lambda idempotent processing](/blog/aws-sqs-lambda-eventbridge-idempotent-async-processing-guide) and [Transactional Outbox](/blog/transactional-outbox-pattern-reliable-event-publishing-guide) (the cloud differs but the principle is the same).

---

## Revisions and Deploy: Blue/Green, Canary, Instant Rollback

Cloud Run's deploy strategy stands on **revisions.** A revision is **an immutable snapshot of code and config**, and you can distribute traffic to each revision **per percent.**

### Deploy Without Flowing Traffic → Verify with a Tagged URL

```bash
# 新リビジョンをデプロイするが、トラフィックは流さない。タグ付きURLだけ発行。
gcloud run deploy api \
  --image IMAGE_URL --region asia-northeast1 \
  --no-traffic --tag green
# → https://green---api-xxxxx.a.run.app で、本番トラフィックと隔離して検証できる
```

### Canary → Blue/Green (Staged Switch)

```bash
# 新リビジョン（latest）に5%だけ流す（カナリア）。残り95%は現行が捌く。
gcloud run services update-traffic api --region asia-northeast1 \
  --to-revisions LATEST=5

# メトリクスが健全なら段階的に引き上げ、最後に100%へ（Blue/Green切替）
gcloud run services update-traffic api --region asia-northeast1 --to-latest
```

### Instant Rollback

```bash
# 問題が出たら、健全な旧リビジョンに100%戻すだけ。再ビルド不要。
gcloud run services update-traffic api --region asia-northeast1 \
  --to-revisions api-00021-abc=100
```

That **rollback is completed by "just returning 100% traffic to an old revision"** is Cloud Run's strength. No re-deploy, no rebuilding of the image. Note, as an official caution, **the switch isn't instantaneous; in-progress requests complete on the original revision.** If session affinity is enabled, beware that affinity affects the routing of returning users too.

---

## Security: Least-Privilege Service Accounts and Secret Management

### Assign a Dedicated Service Account Per Service

This is **the first thing to set in Cloud Run.** Specify nothing and the service runs with the **default Compute Engine service account**, which in many cases has **too-broad Editor permissions.**

> We strongly recommend that you disable the automatic role grant by enforcing the `iam.automaticIamGrantsForDefaultServiceAccounts` organization policy constraint.（— [Service identity](https://docs.cloud.google.com/run/docs/securing/service-identity)）

The right answer is to **make a least-privilege user-managed service account per service and assign it with `--service-account`.**

```bash
# このサービス専用のSAを作り、必要な権限だけを付与（最小権限）
gcloud iam service-accounts create api-runtime --display-name "api runtime"

# 例：このSAに Secret Manager の参照権限だけ与える
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member "serviceAccount:api-runtime@PROJECT_ID.iam.gserviceaccount.com" \
  --role "roles/secretmanager.secretAccessor"

gcloud run deploy api --region asia-northeast1 \
  --service-account api-runtime@PROJECT_ID.iam.gserviceaccount.com ...
```

In my broadcaster platform, I **assigned a dedicated SA per service**, operated Cloud SQL with **IAM auth, mandatory TLS (ENCRYPTED_ONLY), and private IP**, erasing credentials from both code and network as much as possible.

### Secrets from Secret Manager: Environment Variables vs Volume

Don't put secrets (API keys, DB passwords) in the image or in env-var plaintext; inject them from **Secret Manager** ([Configure secrets](https://docs.cloud.google.com/run/docs/configuring/services/secrets)). There are 2 injection methods, and **their meaning differs.**

| Method | Value resolution | Suited use |
|------|---------|--------------|
| **Environment variable** | **Fixed at instance startup.** Doesn't change while running | A secret you want to fix the version of (**specify a concrete version, not `latest`**) |
| **Volume mount** | **Always fetches the latest version** (as a file) | **A rotating secret** (follows the new value on the next read) |

```bash
# 環境変数として注入（バージョンを固定）。SAに roles/secretmanager.secretAccessor が必要。
gcloud run deploy api --region asia-northeast1 \
  --set-secrets "DB_PASSWORD=db-password:3"

# ボリュームとしてマウント（常に最新＝ローテーション向き）
gcloud run deploy api --region asia-northeast1 \
  --set-secrets "/etc/secrets/db/password=db-password:latest"
```

### Harden the Entrance: Auth + Cloud Armor

- **Make inter-service calls and in-house tools auth-required** (`--no-allow-unauthenticated`). Give the calling side's SA `roles/run.invoker` and call with an ID token.
- For a public endpoint, place an **external HTTP(S) load balancer + Cloud Armor** in front, applying a WAF (OWASP rules), rate limiting, and adaptive DDoS protection. In my platform, I placed **Cloud Armor (OWASP CRS 3.3 + adaptive DDoS + rate limiting)** at the entrance, in an operation of **fully enabling the WAF in stg to crush false positives before production.** For the philosophy of defense in depth, see also the [WAF defense-in-depth guide](/blog/waf-defense-in-depth-aws-waf-cloud-armor-owasp-guide).

---

## Networking: Make Direct VPC egress the Default

To go from Cloud Run to **a resource within the VPC (Cloud SQL's private IP, Memorystore, an internal API)**, there are 2 methods. The official docs recommend the newer one ([Networking best practices](https://docs.cloud.google.com/run/docs/configuring/networking-best-practices)).

| Method | Characteristics |
|------|------|
| **Direct VPC egress (recommended, GA)** | **No connector VM.** No idle billing, low latency, high throughput. Needs subnet IP space |
| **Serverless VPC Access connector** | The old method. The connector VM's resident cost / operation rides on |

```bash
# Direct VPC egress：コネクタを介さず直接VPCへ出る
gcloud run deploy api --region asia-northeast1 \
  --network projects/PROJECT_ID/global/networks/my-vpc \
  --subnet projects/PROJECT_ID/regions/asia-northeast1/subnetworks/run-subnet \
  --vpc-egress private-ranges-only
```

For new, Direct VPC egress without hesitation. Because the connector's resident cost and idle billing vanish, it's advantageous on cost too. The craftsmanship of ingress control, IAM auth, Cloud SQL private-IP connection, and Cloud Armor defense in depth is detailed in the [networking and security guide](/blog/google-cloud-run-networking-security-vpc-egress-cloud-armor-iam-ingress-guide).

---

## Jobs and Worker Pools: Where to Place Processing Unsuited to HTTP

Once you accept "Services handle synchronous HTTP," the production design organizes at once. The craftsmanship of task splitting, idempotency, resumability design, and orchestration with Cloud Workflows is systematized in the dedicated article [Cloud Run Jobs and Cloud Workflows guide](/blog/google-cloud-run-jobs-workflows-batch-async-idempotent-guide).

- **Cloud Run Jobs**: processing that **runs, finishes, and stops.** DB migration, periodic batches, long-running bulk processing. Split parallelism with `--tasks`/`--parallelism`, retry with `--max-retries`. Cron start with Cloud Scheduler, event start with Eventarc.
- **Worker Pools**: **resident background processing.** A Pub/Sub pull subscriber, a Kafka consumer, etc. — workloads that keep running without receiving HTTP requests.

```bash
# 素材のマルウェアスキャンを Eventarc（GCSイベント）で起動する例
gcloud eventarc triggers create scan-on-upload \
  --location asia-northeast1 \
  --destination-run-service malware-scanner \
  --event-filters "type=google.cloud.storage.object.v1.finalized" \
  --event-filters "bucket=uploads-raw" \
  --service-account eventarc-invoker@PROJECT_ID.iam.gserviceaccount.com
```

> My platform's malware scanner received an upload to GCS via Eventarc, passed it to ClamAV (Cloud Run), and **stream-scanned up to 10GiB material without buffering**, sorting it into clean/quarantine buckets. The atomicity of `File.move` made it idempotent against retries, and it safely ignored zero-length, uploading, and deleted ones. **"Decouple heavy processing from HTTP" and "make it idempotent"** — these 2 points are the spine of Cloud Run production operation.

---

## IaC: Build It Declaratively with Terraform

Production is built not with manual `gcloud` but **declaratively with Terraform (`google_cloud_run_v2_service`).** The settings so far (concurrency, scale, timeout, probes, SA, billing mode, execution environment) are consolidated in one place.

```hcl
resource "google_cloud_run_v2_service" "api" {
  name     = "api"
  location = "asia-northeast1"
  ingress  = "INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER" # 入口はLB+Cloud Armor経由に限定

  template {
    service_account                  = google_service_account.api_runtime.email
    max_instance_request_concurrency = 80
    timeout                          = "300s"
    execution_environment            = "EXECUTION_ENVIRONMENT_GEN2"

    scaling {
      min_instance_count = 1   # 本番の入口は温める
      max_instance_count = 10  # コストの安全弁
    }

    containers {
      image = "asia-northeast1-docker.pkg.dev/${var.project_id}/repo/api:${var.image_tag}"
      ports { container_port = 8080 }

      resources {
        limits            = { cpu = "1", memory = "512Mi" }
        cpu_idle          = true   # true=リクエスト課金（アイドル時CPU停止）/ false=インスタンス課金
        startup_cpu_boost = true   # 冷起動を速くする
      }

      startup_probe {
        tcp_socket { port = 8080 }
        failure_threshold = 10     # 起動が遅いアプリは余裕を持たせる
        period_seconds    = 5
        timeout_seconds   = 3
      }
      liveness_probe {
        http_get { path = "/healthz" }
        period_seconds = 10
      }

      # 秘密は Secret Manager から注入（バージョン固定）
      env {
        name = "DB_PASSWORD"
        value_source {
          secret_key_ref {
            secret  = google_secret_manager_secret.db_password.secret_id
            version = "3"
          }
        }
      }
    }

    # VPC内リソース（Cloud SQLプライベートIP等）へは Direct VPC egress
    vpc_access {
      network_interfaces {
        subnetwork = google_compute_subnetwork.run.id
      }
      egress = "PRIVATE_RANGES_ONLY"
    }
  }

  # 最新リビジョンに100%（カナリア時はここを複数 traffic ブロックに分割）
  traffic {
    type    = "TRAFFIC_TARGET_ALLOCATION_TYPE_LATEST"
    percent = 100
  }
}
```

`cpu_idle = true` corresponds to **request billing (stop CPU when idle)**, and `false` to **instance billing (CPU always secured).** This choice greatly affects cost (detailed in the [billing article](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide)).

Making CI/CD keyless (Workload Identity Federation) is compiled in detail in a separate article: [Make GitHub Actions keyless](/blog/github-actions-oidc-keyless-cicd-aws-gcp-guide). In my project, I **coded the whole of GCP in about 71 Terraform modules and separated stg/prod state** for operation.

---

## Observability: Just Emit Structured Logs to Standard Output

Cloud Run **ingests standard output / standard error as-is into Cloud Logging.** If the app emits **structured logs in JSON to stdout/stderr, not files**, logs gather without an additional agent.

```python
import json, logging, sys

class JsonFormatter(logging.Formatter):
    def format(self, record):
        # Cloud Logging は severity / trace を解釈する。相関のため trace を載せる。
        return json.dumps({
            "severity": record.levelname,
            "message": record.getMessage(),
            "logging.googleapis.com/trace": getattr(record, "trace", None),
        })

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.basicConfig(level=logging.INFO, handlers=[handler])
```

- **Metrics** (request count, latency, instance count, CPU/memory utilization) come out to Cloud Monitoring automatically. Tie SLOs and alerts here.
- **Traces** are instrumented with OpenTelemetry and sent to Cloud Trace / Cloud Monitoring. My malware scanner too **sent scan results to Cloud Monitoring with OpenTelemetry.** For the philosophy of observability, see the [OpenTelemetry practical guide](/blog/opentelemetry-observability-production-tracing-metrics-logs).

---

## Pre-Production Checklist

- [ ] Listening on `$PORT`・`0.0.0.0` (the container contract)
- [ ] **SIGTERM handler** cleans up within 10 seconds, and the **processing is idempotent**
- [ ] Doesn't hold state on an instance (aggregated in an external store)
- [ ] Set **concurrency** to match the load characteristics (default 80; don't recklessly set 1)
- [ ] Set **min/max instances** (keep the entrance warm, make a cost cap)
- [ ] Reviewed the **request timeout**; processing exceeding it goes to **Jobs/Workflows**
- [ ] Set **startup/liveness probes** (widen the threshold if startup is slow)
- [ ] Assign a **dedicated least-privilege service account**, don't use the default SA
- [ ] Secrets in **Secret Manager** (env var + version if you want to fix it, volume if rotating)
- [ ] Publish only what truly needs it (default is **`--no-allow-unauthenticated`**), the public face with **Cloud Armor**
- [ ] VPC connection is **Direct VPC egress**
- [ ] Managed declaratively with **Terraform**, CI/CD **keyless with Workload Identity Federation**
- [ ] Logs are **structured JSON to stdout/stderr**, observable with metrics/traces

---

## Summary: The Crux of Serverless Containers Is Portable

Cloud Run is a serverless foundation for "concentrating on just running containers." The key to production quality is not special magic but **upholding the contract** — listen on `$PORT`, close carefully on SIGTERM, hold no state, decouple heavy processing to jobs, control cost with concurrency and scale, and harden with least privilege and secret management.

These are **the common crux of serverless containers**, unchanged on AWS Fargate or Azure Container Apps. I have run a broadcaster platform in production on GCP・Cloud Run, and a payment foundation and lumber-distribution DX on AWS・Fargate. Even when the cloud changes, **the design principles for running containers in production "unbreakable, cheap, and safe" are continuous.**

If you're torn on technology selection, continue to the [GCP container technology-selection guide](/blog/google-cloud-run-vs-gke-app-engine-cloud-run-functions-compute-selection-guide); if you want to refine cost, the [concurrency, auto-scale, billing guide](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide).