# Cloud Run troubleshooting compendium: causes and fixes for start failures, 503/504, OOM (exit 137), cold starts, and deploy failures

> A practical guide to fixing common production Cloud Run errors by cause, with the exact official messages. From 'Container failed to start and listen on the port defined by the PORT environment variable,' exit 137 (OOM) from memory overrun, 503 'no available instance,' 504 request timeout, and image-pull permission errors to slow cold starts — it explains with diagnosis steps and gcloud/code fixes.

- Published: 2026-06-28
- Author: 友田 陽大
- Tags: GCP, Cloud Run, トラブルシューティング, 可観測性, サーバーレス, パフォーマンス, 信頼性, インフラ
- URL: https://tomodahinata.com/en/blog/google-cloud-run-troubleshooting-container-failed-to-start-cold-start-timeout-oom-guide
- Category: Google Cloud Run in production
- Pillar guide: https://tomodahinata.com/en/blog/google-cloud-run-production-guide

## Key points

- The most frequent is 'Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable.' The cause is one of: not listening on 0.0.0.0 at $PORT, not built for 64-bit Linux, or a startup error.
- Exit 137 is 'the container instance was found to be using too much memory and was terminated' = memory overrun (OOM). The cause is a memory leak, insufficient allocation, or writes to the in-memory FS. Add memory, fix the leak, or revisit the log destination.
- 503 'The request was aborted because there was no available instance' is instance shortage, cold start, or probe failure. Fix with more max instances, warming with min instances, and relaxing the probe timeout.
- 504 'reached the maximum request timeout' is processing exceeding the timeout or reuse of a dead connection. Extend the timeout, validate connections + retry, or split long-running processing to Jobs.
- Most deploy failures are permissions. 'Cloud Run Service Agent must have permission to read the image' is the service agent lacking Artifact Registry read permission. Diagnosis is Cloud Logging and local reproduction (docker run -e PORT=8080).

---

Cloud Run errors, once you read the message correctly, narrow the cause almost uniquely. Conversely, **tweaking settings "on a hunch" sinks you into a quagmire.** This article organizes common production errors by **the exact message in the [official documentation](https://docs.cloud.google.com/run/docs/troubleshooting)** and shows the cause and fix in the shortest path. It aims to let someone who "googled `cloud run container failed to start`" **solve it and leave on that page.**

For prevention so you don't hit them at the design stage, see the [Cloud Run production-operations guide](/blog/google-cloud-run-production-guide); for cold-start and cost optimization, the [concurrency & billing guide](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide).

---

## First, diagnosis: a two-pronged logs and local reproduction

Before getting into individual errors, decide on the **two things you do first for any problem.**

```bash
# 1. リビジョンのログを見る（エラーの一次情報はここ）
gcloud run services logs read api --region asia-northeast1 --limit 100

# 2. ローカルで本番と同じ条件を再現する（PORTを注入して起動するか？）
docker run --rm -e PORT=8080 -p 8080:8080 \
  asia-northeast1-docker.pkg.dev/PROJECT_ID/app/api:TAG
# → http://localhost:8080 に到達できなければ、本番でも起動しない
```

**"Does it start locally with `-e PORT=8080`"** — this alone reproduces and solves most start-related errors at hand. Cloud Logging shows the app's stdout/stderr as-is, so if you emit structured logs, cause-tracing gets faster ([observability design](/blog/google-cloud-run-production-guide)).

---

## Error 1: Container failed to start (most frequent)

> **`Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable.`**

The most frequent error, appearing right after deploy. Cloud Run judged "I started the container, but it doesn't listen on `$PORT`."

**Causes and fixes:**

1. **Not listening on `$PORT` at `0.0.0.0`** (most common). Fixed to `localhost`/`127.0.0.1`, or the port is hardcoded.

```python
# ✗ 悪い例：127.0.0.1 / ポート固定 → Cloud Runから到達できず起動失敗
uvicorn.run(app, host="127.0.0.1", port=3000)

# ✓ 正しい例：0.0.0.0 で、PORT環境変数を読む
import os
uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("PORT", "8080")))
```

2. **Not built for 64-bit Linux.** Prone to happen when using an image built locally on Apple Silicon (arm64) as-is. Build multi-arch, or specify `--platform linux/amd64`.

```bash
docker build --platform linux/amd64 -t IMAGE_URL .
```

3. **The startup processing itself fails** (waiting on a dependency-service connection at startup and timing out, an exception from missing config). **Don't do heavy synchronous initialization or external connections in a blocking way at startup.** Check whether an exception appears in the log. If startup is truly slow, widen the [startup probe](/blog/google-cloud-run-production-guide)'s `failure_threshold × period_seconds`.

---

## Error 2: memory overrun / exit 137 (OOM)

> **`... the container instance was found to be using too much memory and was terminated.`** (the container's exit code is **137**)

OOM-Killed for exceeding the allocated memory. Note that Cloud Run's **writable file system is in-memory**, so what you write eats memory.

**Causes and fixes:**

1. **Insufficient memory allocation** → increase it to match the peak (but blindly increasing raises cost; judge by metrics).

```bash
gcloud run services update api --region asia-northeast1 --memory 1Gi
```

2. **Writes to the in-memory FS** (large temp files, logging to local) → keep temporary data small, or **stream to Cloud Storage.** Logs to **stdout, not files.**

3. **Memory leak** (accumulation of connections/buffers) → fix the leak. Check whether you hold large objects between requests.

> My malware scanner handled material up to 10GiB, so it **scanned by streaming without buffering** to avoid memory exhaustion. A design that "loads everything into memory before processing" is especially dangerous on serverless. **Process large data by streaming.**

---

## Error 3: 503 Service Unavailable

> **`The request was aborted because there was no available instance.`**

No instance was available to handle the request. Spikes, cold starts, scale limits, and probe failures are involved.

**Causes and fixes:**

1. **Max instances too low, lost to a spike** → raise `--max-instances`.
2. **Cold start too slow, startup doesn't make it** → warm with `--min-instances 1` or more, [cold-start countermeasures](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide) (startup CPU boost, slim image, gen1).
3. **Concurrency too low, swelling the needed instances** → revisit concurrency (raise it if I/O-bound).
4. **The liveness probe too strict, instances repeatedly restart** → relax the probe timeout/threshold, and don't put a heavy dependency check in the probe.

```bash
gcloud run services update api --region asia-northeast1 \
  --min-instances 1 --max-instances 20 --cpu-boost
```

---

## Error 4: 504 Gateway Timeout

> **`The request has been terminated because it has reached the maximum request timeout.`**

The response didn't finish within the [request timeout](/blog/google-cloud-run-production-guide) (default 300s, max 60 min).

**Causes and fixes:**

1. **The processing is simply long** → extend the timeout (`--timeout`). But **60 minutes is the cap.**

```bash
gcloud run services update api --region asia-northeast1 --timeout 600
```

2. **Long-running processing that shouldn't be held in HTTP in the first place** → **split it to Cloud Run Jobs / Workflows** ([Jobs/Workflows guide](/blog/google-cloud-run-jobs-workflows-batch-async-idempotent-guide)). Endlessly extending the timeout is deferring the design.
3. **Reuse of a dead connection** (reusing a DB/HTTP connection from a broken pool) → add connection-health checks and retries.

---

## Error 5: image-pull / permission error (deploy failure)

> **`... the Google Cloud Run Service Agent must have permission to read the image ...`**

Can't read the image at deploy = **insufficient permission.** Not a code problem.

**Causes and fixes:**

- Give the Cloud Run **service agent** (`service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com`) **Artifact Registry read permission** for the registry holding the image.

```bash
gcloud artifacts repositories add-iam-policy-binding app \
  --location asia-northeast1 \
  --member "serviceAccount:service-PROJECT_NUMBER@serverless-robot-prod.iam.gserviceaccount.com" \
  --role "roles/artifactregistry.reader"
```

- **The image is in another project** → grant permission on that project's side.
- Check whether **VPC Service Controls** is blocking the read.

Relatedly, for cases where the deploy itself fails due to insufficient permission of the deploy SA (CI), see the [CI/CD guide](/blog/google-cloud-run-cicd-cloud-build-github-actions-workload-identity-blue-green-canary-guide) (`run.developer` + AR read + `serviceAccountUser` on the runtime SA).

---

## Error 6: container import failure

> **`The service has encountered an error during container import. Resource readiness deadline exceeded.`**

Failure at the image-import stage. Relatively rare but hard to pin down.

**Causes and fixes:**

- **A non-UTF-8 filename** in the image → rebuild in UTF-8.
- **Unsupported layers of a Windows image (foreign layers)** → rebuild with `--allow-nondistributable-artifacts` enabled.
- The basic is to make it **a straightforward 64-bit Linux image** (multi-stage, distroless, etc.).

---

## Error 7: slow cold start (not an error but breaks the feel)

No explicit error, yet "only the first request is oddly slow" — that's a **cold start** from scale-to-zero.

**Diagnosis and fixes (lightest first):**

1. **Make the image small** (multi-stage, distroless/alpine). Pull and unpack get faster.
2. **Startup CPU boost** (`--cpu-boost`) to speed up initialization.
3. **Lazy initialization**: don't load everything at startup; initialize on demand. **Create connections once outside the handler** and reuse them.
4. **Execution environment gen1** (faster cold start) for lightweight APIs.
5. Only for paths whose feel is still unacceptable, warm with **`--min-instances 1`** (a trade-off with always-on billing).

The detailed cost estimate and moves are summarized in the [concurrency, autoscale, and cost-optimization guide](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide).

---

## Diagnosis quick reference

| Symptom / message | Primary cause | What to try first |
|------------------|---------|------------|
| `failed to start and then listen on the port` | Can't listen on 0.0.0.0/$PORT, arm64, startup error | Reproduce locally with `-e PORT=8080`, `--platform linux/amd64` |
| exit 137 / `too much memory` | OOM (insufficient, leak, in-memory FS) | Increase `--memory`, stream, logs to stdout |
| 503 `no available instance` | Cold start, scale limit, probe | `--min-instances`, `--max-instances`, `--cpu-boost`, relax probe |
| 504 `maximum request timeout` | Processing exceeded, dead connection | `--timeout`, long-running to Jobs, validate connections |
| `Service Agent must have permission to read the image` | Insufficient AR read permission | Give the service agent `artifactregistry.reader` |
| `Resource readiness deadline exceeded` | Non-UTF-8 / foreign layers | Rebuild in UTF-8 / 64-bit Linux |

---

## Production-launch checklist (prevention)

- [ ] **`docker run -e PORT=8080`** starts locally
- [ ] The image is **64-bit Linux** (`--platform linux/amd64` or multi-arch)
- [ ] No **heavy synchronous processing / blocking wait on external connections** at startup
- [ ] **Stream large data without loading into memory** (don't exhaust the in-memory FS)
- [ ] Set **min/max instances** and probes to prevent 503
- [ ] Long-running processing to **Jobs/Workflows** (don't paper over 504 by extending the timeout)
- [ ] Speed up cause-tracing with **structured logs** (stdout)
- [ ] Confirm the service agent's / deploy SA's **permissions** in advance

---

## Summary: read the message and the cause is narrowed

Cloud Run troubleshooting is 90% **reading the official error message accurately.** "Won't start" → `$PORT`, `0.0.0.0`, architecture; "137" → memory; "503" → scale and cold start; "504" → splitting off long-running processing; "image permission" → permission — it's designed so you can **descend uniquely from the message to the cause.** Before rushing to tweak settings, read the log and reproduce locally. That's the shortest path.

Prevention starts from design: the whole picture in the [Cloud Run production-operations guide](/blog/google-cloud-run-production-guide), cost and cold start in the [concurrency & billing guide](/blog/google-cloud-run-autoscaling-concurrency-billing-cost-optimization-guide), long-running processing in the [Jobs/Workflows guide](/blog/google-cloud-run-jobs-workflows-batch-async-idempotent-guide). If you need investigation of a production incident and the design of a permanent fix, I help with real-operations know-how.
