Skip to main content
友田 陽大
Google Cloud Run in production
GCP
Cloud Run
トラブルシューティング
可観測性
サーバーレス
パフォーマンス
信頼性
インフラ

Cloud Run troubleshooting compendium: causes and fixes for start failures, 503/504, OOM (exit 137), cold starts, and deploy failures

A practical guide to fixing common production Cloud Run errors by cause, with the exact official messages. From 'Container failed to start and listen on the port defined by the PORT environment variable,' exit 137 (OOM) from memory overrun, 503 'no available instance,' 504 request timeout, and image-pull permission errors to slow cold starts — it explains with diagnosis steps and gcloud/code fixes.

Published
Reading time
7 min read
Author
友田 陽大
Share

Cloud Run errors, once you read the message correctly, narrow the cause almost uniquely. Conversely, tweaking settings "on a hunch" sinks you into a quagmire. This article organizes common production errors by the exact message in the official documentation and shows the cause and fix in the shortest path. It aims to let someone who "googled cloud run container failed to start" solve it and leave on that page.

For prevention so you don't hit them at the design stage, see the Cloud Run production-operations guide; for cold-start and cost optimization, the concurrency & billing guide.


First, diagnosis: a two-pronged logs and local reproduction

Before getting into individual errors, decide on the two things you do first for any problem.

# 1. リビジョンのログを見る(エラーの一次情報はここ)
gcloud run services logs read api --region asia-northeast1 --limit 100

# 2. ローカルで本番と同じ条件を再現する(PORTを注入して起動するか?)
docker run --rm -e PORT=8080 -p 8080:8080 \
  asia-northeast1-docker.pkg.dev/PROJECT_ID/app/api:TAG
# → http://localhost:8080 に到達できなければ、本番でも起動しない

"Does it start locally with -e PORT=8080" — this alone reproduces and solves most start-related errors at hand. Cloud Logging shows the app's stdout/stderr as-is, so if you emit structured logs, cause-tracing gets faster (observability design).


Error 1: Container failed to start (most frequent)

Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable.

The most frequent error, appearing right after deploy. Cloud Run judged "I started the container, but it doesn't listen on $PORT."

Causes and fixes:

  1. Not listening on $PORT at 0.0.0.0 (most common). Fixed to localhost/127.0.0.1, or the port is hardcoded.
# ✗ 悪い例:127.0.0.1 / ポート固定 → Cloud Runから到達できず起動失敗
uvicorn.run(app, host="127.0.0.1", port=3000)

# ✓ 正しい例:0.0.0.0 で、PORT環境変数を読む
import os
uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("PORT", "8080")))
  1. Not built for 64-bit Linux. Prone to happen when using an image built locally on Apple Silicon (arm64) as-is. Build multi-arch, or specify --platform linux/amd64.
docker build --platform linux/amd64 -t IMAGE_URL .
  1. The startup processing itself fails (waiting on a dependency-service connection at startup and timing out, an exception from missing config). Don't do heavy synchronous initialization or external connections in a blocking way at startup. Check whether an exception appears in the log. If startup is truly slow, widen the startup probe's failure_threshold × period_seconds.

Error 2: memory overrun / exit 137 (OOM)

... the container instance was found to be using too much memory and was terminated. (the container's exit code is 137)

OOM-Killed for exceeding the allocated memory. Note that Cloud Run's writable file system is in-memory, so what you write eats memory.

Causes and fixes:

  1. Insufficient memory allocation → increase it to match the peak (but blindly increasing raises cost; judge by metrics).
gcloud run services update api --region asia-northeast1 --memory 1Gi
  1. Writes to the in-memory FS (large temp files, logging to local) → keep temporary data small, or stream to Cloud Storage. Logs to stdout, not files.

  2. Memory leak (accumulation of connections/buffers) → fix the leak. Check whether you hold large objects between requests.

My malware scanner handled material up to 10GiB, so it scanned by streaming without buffering to avoid memory exhaustion. A design that "loads everything into memory before processing" is especially dangerous on serverless. Process large data by streaming.


Error 3: 503 Service Unavailable

The request was aborted because there was no available instance.

No instance was available to handle the request. Spikes, cold starts, scale limits, and probe failures are involved.

Causes and fixes:

  1. Max instances too low, lost to a spike → raise --max-instances.
  2. Cold start too slow, startup doesn't make it → warm with --min-instances 1 or more, cold-start countermeasures (startup CPU boost, slim image, gen1).
  3. Concurrency too low, swelling the needed instances → revisit concurrency (raise it if I/O-bound).
  4. The liveness probe too strict, instances repeatedly restart → relax the probe timeout/threshold, and don't put a heavy dependency check in the probe.
gcloud run services update api --region asia-northeast1 \
  --min-instances 1 --max-instances 20 --cpu-boost

Error 4: 504 Gateway Timeout

The request has been terminated because it has reached the maximum request timeout.

The response didn't finish within the request timeout (default 300s, max 60 min).

Causes and fixes:

  1. The processing is simply long → extend the timeout (--timeout). But 60 minutes is the cap.
gcloud run services update api --region asia-northeast1 --timeout 600
  1. Long-running processing that shouldn't be held in HTTP in the first placesplit it to Cloud Run Jobs / Workflows (Jobs/Workflows guide). Endlessly extending the timeout is deferring the design.
  2. Reuse of a dead connection (reusing a DB/HTTP connection from a broken pool) → add connection-health checks and retries.

Error 5: image-pull / permission error (deploy failure)

... the Google Cloud Run Service Agent must have permission to read the image ...

Can't read the image at deploy = insufficient permission. Not a code problem.

Causes and fixes:

  • Give the Cloud Run service agent (service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com) Artifact Registry read permission for the registry holding the image.
gcloud artifacts repositories add-iam-policy-binding app \
  --location asia-northeast1 \
  --member "serviceAccount:service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com" \
  --role "roles/artifactregistry.reader"
  • The image is in another project → grant permission on that project's side.
  • Check whether VPC Service Controls is blocking the read.

Relatedly, for cases where the deploy itself fails due to insufficient permission of the deploy SA (CI), see the CI/CD guide (run.developer + AR read + serviceAccountUser on the runtime SA).


Error 6: container import failure

The service has encountered an error during container import. Resource readiness deadline exceeded.

Failure at the image-import stage. Relatively rare but hard to pin down.

Causes and fixes:

  • A non-UTF-8 filename in the image → rebuild in UTF-8.
  • Unsupported layers of a Windows image (foreign layers) → rebuild with --allow-nondistributable-artifacts enabled.
  • The basic is to make it a straightforward 64-bit Linux image (multi-stage, distroless, etc.).

Error 7: slow cold start (not an error but breaks the feel)

No explicit error, yet "only the first request is oddly slow" — that's a cold start from scale-to-zero.

Diagnosis and fixes (lightest first):

  1. Make the image small (multi-stage, distroless/alpine). Pull and unpack get faster.
  2. Startup CPU boost (--cpu-boost) to speed up initialization.
  3. Lazy initialization: don't load everything at startup; initialize on demand. Create connections once outside the handler and reuse them.
  4. Execution environment gen1 (faster cold start) for lightweight APIs.
  5. Only for paths whose feel is still unacceptable, warm with --min-instances 1 (a trade-off with always-on billing).

The detailed cost estimate and moves are summarized in the concurrency, autoscale, and cost-optimization guide.


Diagnosis quick reference

Symptom / messagePrimary causeWhat to try first
failed to start and then listen on the portCan't listen on 0.0.0.0/$PORT, arm64, startup errorReproduce locally with -e PORT=8080, --platform linux/amd64
exit 137 / too much memoryOOM (insufficient, leak, in-memory FS)Increase --memory, stream, logs to stdout
503 no available instanceCold start, scale limit, probe--min-instances, --max-instances, --cpu-boost, relax probe
504 maximum request timeoutProcessing exceeded, dead connection--timeout, long-running to Jobs, validate connections
Service Agent must have permission to read the imageInsufficient AR read permissionGive the service agent artifactregistry.reader
Resource readiness deadline exceededNon-UTF-8 / foreign layersRebuild in UTF-8 / 64-bit Linux

Production-launch checklist (prevention)

  • docker run -e PORT=8080 starts locally
  • The image is 64-bit Linux (--platform linux/amd64 or multi-arch)
  • No heavy synchronous processing / blocking wait on external connections at startup
  • Stream large data without loading into memory (don't exhaust the in-memory FS)
  • Set min/max instances and probes to prevent 503
  • Long-running processing to Jobs/Workflows (don't paper over 504 by extending the timeout)
  • Speed up cause-tracing with structured logs (stdout)
  • Confirm the service agent's / deploy SA's permissions in advance

Summary: read the message and the cause is narrowed

Cloud Run troubleshooting is 90% reading the official error message accurately. "Won't start" → $PORT, 0.0.0.0, architecture; "137" → memory; "503" → scale and cold start; "504" → splitting off long-running processing; "image permission" → permission — it's designed so you can descend uniquely from the message to the cause. Before rushing to tweak settings, read the log and reproduce locally. That's the shortest path.

Prevention starts from design: the whole picture in the Cloud Run production-operations guide, cost and cold start in the concurrency & billing guide, long-running processing in the Jobs/Workflows guide. If you need investigation of a production incident and the design of a permanent fix, I help with real-operations know-how.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

I can take on the implementation from this article as an engagement

GCP / Cloud Run container platforms, from design to production and cost optimization

Building container platforms on Cloud Run (services + jobs), migration from AWS/on-prem, keyless CI/CD via Workload Identity, defense-in-depth with Cloud Armor and least privilege, and cost optimization of concurrency and the billing model. With experience building and operating a broadcaster platform on GCP with IaC, I deliver fast, cheap, and secure.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading