Cloud Run troubleshooting compendium: causes and fixes for start failures, 503/504, OOM (exit 137), cold starts, and deploy failures

Cloud Run errors, once you read the message correctly, narrow the cause almost uniquely. Conversely, tweaking settings "on a hunch" sinks you into a quagmire. This article organizes common production errors by the exact message in the official documentation and shows the cause and fix in the shortest path. It aims to let someone who "googled cloud run container failed to start" solve it and leave on that page.

For prevention so you don't hit them at the design stage, see the Cloud Run production-operations guide; for cold-start and cost optimization, the concurrency & billing guide.

First, diagnosis: a two-pronged logs and local reproduction

Before getting into individual errors, decide on the two things you do first for any problem.

# 1. リビジョンのログを見る（エラーの一次情報はここ）
gcloud run services logs read api --region asia-northeast1 --limit 100

# 2. ローカルで本番と同じ条件を再現する（PORTを注入して起動するか？）
docker run --rm -e PORT=8080 -p 8080:8080 \
  asia-northeast1-docker.pkg.dev/PROJECT_ID/app/api:TAG
# → http://localhost:8080 に到達できなければ、本番でも起動しない

"Does it start locally with -e PORT=8080" — this alone reproduces and solves most start-related errors at hand. Cloud Logging shows the app's stdout/stderr as-is, so if you emit structured logs, cause-tracing gets faster (observability design).

Error 1: Container failed to start (most frequent)

Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable.

The most frequent error, appearing right after deploy. Cloud Run judged "I started the container, but it doesn't listen on $PORT."

Causes and fixes:

Not listening on $PORT at 0.0.0.0 (most common). Fixed to localhost/127.0.0.1, or the port is hardcoded.

# ✗ 悪い例：127.0.0.1 / ポート固定 → Cloud Runから到達できず起動失敗
uvicorn.run(app, host="127.0.0.1", port=3000)

# ✓ 正しい例：0.0.0.0 で、PORT環境変数を読む
import os
uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("PORT", "8080")))

Not built for 64-bit Linux. Prone to happen when using an image built locally on Apple Silicon (arm64) as-is. Build multi-arch, or specify --platform linux/amd64.

docker build --platform linux/amd64 -t IMAGE_URL .

The startup processing itself fails (waiting on a dependency-service connection at startup and timing out, an exception from missing config). Don't do heavy synchronous initialization or external connections in a blocking way at startup. Check whether an exception appears in the log. If startup is truly slow, widen the startup probe's failure_threshold × period_seconds.

Error 2: memory overrun / exit 137 (OOM)

... the container instance was found to be using too much memory and was terminated. (the container's exit code is 137)

OOM-Killed for exceeding the allocated memory. Note that Cloud Run's writable file system is in-memory, so what you write eats memory.

Causes and fixes:

Insufficient memory allocation → increase it to match the peak (but blindly increasing raises cost; judge by metrics).

gcloud run services update api --region asia-northeast1 --memory 1Gi

Writes to the in-memory FS (large temp files, logging to local) → keep temporary data small, or stream to Cloud Storage. Logs to stdout, not files.
Memory leak (accumulation of connections/buffers) → fix the leak. Check whether you hold large objects between requests.

My malware scanner handled material up to 10GiB, so it scanned by streaming without buffering to avoid memory exhaustion. A design that "loads everything into memory before processing" is especially dangerous on serverless. Process large data by streaming.

Error 3: 503 Service Unavailable

The request was aborted because there was no available instance.

No instance was available to handle the request. Spikes, cold starts, scale limits, and probe failures are involved.

Causes and fixes:

Max instances too low, lost to a spike → raise --max-instances.
Cold start too slow, startup doesn't make it → warm with --min-instances 1 or more, cold-start countermeasures (startup CPU boost, slim image, gen1).
Concurrency too low, swelling the needed instances → revisit concurrency (raise it if I/O-bound).
The liveness probe too strict, instances repeatedly restart → relax the probe timeout/threshold, and don't put a heavy dependency check in the probe.

gcloud run services update api --region asia-northeast1 \
  --min-instances 1 --max-instances 20 --cpu-boost

Error 4: 504 Gateway Timeout

The request has been terminated because it has reached the maximum request timeout.

The response didn't finish within the request timeout (default 300s, max 60 min).

Causes and fixes:

The processing is simply long → extend the timeout (--timeout). But 60 minutes is the cap.

gcloud run services update api --region asia-northeast1 --timeout 600

Long-running processing that shouldn't be held in HTTP in the first place → split it to Cloud Run Jobs / Workflows (Jobs/Workflows guide). Endlessly extending the timeout is deferring the design.
Reuse of a dead connection (reusing a DB/HTTP connection from a broken pool) → add connection-health checks and retries.

Error 5: image-pull / permission error (deploy failure)

... the Google Cloud Run Service Agent must have permission to read the image ...

Can't read the image at deploy = insufficient permission. Not a code problem.

Causes and fixes:

Give the Cloud Run service agent (service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com) Artifact Registry read permission for the registry holding the image.

gcloud artifacts repositories add-iam-policy-binding app \
  --location asia-northeast1 \
  --member "serviceAccount:service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com" \
  --role "roles/artifactregistry.reader"

The image is in another project → grant permission on that project's side.
Check whether VPC Service Controls is blocking the read.

Relatedly, for cases where the deploy itself fails due to insufficient permission of the deploy SA (CI), see the CI/CD guide (run.developer + AR read + serviceAccountUser on the runtime SA).

Error 6: container import failure

The service has encountered an error during container import. Resource readiness deadline exceeded.

Failure at the image-import stage. Relatively rare but hard to pin down.

Causes and fixes:

A non-UTF-8 filename in the image → rebuild in UTF-8.
Unsupported layers of a Windows image (foreign layers) → rebuild with --allow-nondistributable-artifacts enabled.
The basic is to make it a straightforward 64-bit Linux image (multi-stage, distroless, etc.).

Error 7: slow cold start (not an error but breaks the feel)

No explicit error, yet "only the first request is oddly slow" — that's a cold start from scale-to-zero.

Diagnosis and fixes (lightest first):

Make the image small (multi-stage, distroless/alpine). Pull and unpack get faster.
Startup CPU boost (--cpu-boost) to speed up initialization.
Lazy initialization: don't load everything at startup; initialize on demand. Create connections once outside the handler and reuse them.
Execution environment gen1 (faster cold start) for lightweight APIs.
Only for paths whose feel is still unacceptable, warm with --min-instances 1 (a trade-off with always-on billing).

The detailed cost estimate and moves are summarized in the concurrency, autoscale, and cost-optimization guide.

Diagnosis quick reference

Symptom / message	Primary cause	What to try first
`failed to start and then listen on the port`	Can't listen on 0.0.0.0/$PORT, arm64, startup error	Reproduce locally with `-e PORT=8080`, `--platform linux/amd64`
exit 137 / `too much memory`	OOM (insufficient, leak, in-memory FS)	Increase `--memory`, stream, logs to stdout
503 `no available instance`	Cold start, scale limit, probe	`--min-instances`, `--max-instances`, `--cpu-boost`, relax probe
504 `maximum request timeout`	Processing exceeded, dead connection	`--timeout`, long-running to Jobs, validate connections
`Service Agent must have permission to read the image`	Insufficient AR read permission	Give the service agent `artifactregistry.reader`
`Resource readiness deadline exceeded`	Non-UTF-8 / foreign layers	Rebuild in UTF-8 / 64-bit Linux

Production-launch checklist (prevention)

docker run -e PORT=8080 starts locally
The image is 64-bit Linux (--platform linux/amd64 or multi-arch)
No heavy synchronous processing / blocking wait on external connections at startup
Stream large data without loading into memory (don't exhaust the in-memory FS)
Set min/max instances and probes to prevent 503
Long-running processing to Jobs/Workflows (don't paper over 504 by extending the timeout)
Speed up cause-tracing with structured logs (stdout)
Confirm the service agent's / deploy SA's permissions in advance

Summary: read the message and the cause is narrowed

Cloud Run troubleshooting is 90% reading the official error message accurately. "Won't start" → $PORT, 0.0.0.0, architecture; "137" → memory; "503" → scale and cold start; "504" → splitting off long-running processing; "image permission" → permission — it's designed so you can descend uniquely from the message to the cause. Before rushing to tweak settings, read the log and reproduce locally. That's the shortest path.

Prevention starts from design: the whole picture in the Cloud Run production-operations guide, cost and cold start in the concurrency & billing guide, long-running processing in the Jobs/Workflows guide. If you need investigation of a production incident and the design of a permanent fix, I help with real-operations know-how.

Cloud Run troubleshooting compendium: causes and fixes for start failures, 503/504, OOM (exit 137), cold starts, and deploy failures

First, diagnosis: a two-pronged logs and local reproduction

Error 1: Container failed to start (most frequent)

Error 2: memory overrun / exit 137 (OOM)

Error 3: 503 Service Unavailable

Error 4: 504 Gateway Timeout

Error 5: image-pull / permission error (deploy failure)

Error 6: container import failure

Error 7: slow cold start (not an error but breaks the feel)

Diagnosis quick reference

Production-launch checklist (prevention)

Summary: read the message and the cause is narrowed

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

Cloud Run CI/CD: keyless, Blue/Green, and canary in real code with Cloud Build / GitHub Actions × Workload Identity

Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

Also worth reading

Azure Container Apps troubleshooting: diagnosing and fixing revision Failed/Degraded, exit code 137, probes, and image-pull failures

Vercel production-operation guide: use it not as a front-end-only host but as a 'full-compute platform'

The complete Vercel troubleshooting compendium: crushing build failures, function errors, 504/413, 404, and cold starts by cause

First, diagnosis: a two-pronged logs and local reproduction

Error 1: Container failed to start (most frequent)

Error 2: memory overrun / exit 137 (OOM)

Error 3: 503 Service Unavailable

Error 4: 504 Gateway Timeout

Error 5: image-pull / permission error (deploy failure)

Error 6: container import failure

Error 7: slow cold start (not an error but breaks the feel)

Diagnosis quick reference

Production-launch checklist (prevention)

Summary: read the message and the cause is narrowed

Related articles

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

Cloud Run concurrency, autoscaling, billing model, and cost optimization: conquering scale-to-zero and cold starts in real code

Cloud Run CI/CD: keyless, Blue/Green, and canary in real code with Cloud Build / GitHub Actions × Workload Identity

Cloud Run Jobs and Cloud Workflows: designing long-running batch and parallel processing to be idempotent and resumable

Also worth reading

Azure Container Apps troubleshooting: diagnosing and fixing revision Failed/Degraded, exit code 137, probes, and image-pull failures

Vercel production-operation guide: use it not as a front-end-only host but as a 'full-compute platform'

The complete Vercel troubleshooting compendium: crushing build failures, function errors, 504/413, 404, and cold starts by cause