# Azure Container Apps troubleshooting: diagnosing and fixing revision Failed/Degraded, exit code 137, probes, and image-pull failures

> A systematic guide to diagnosing and fixing when Azure Container Apps won't start or crashes. From revision Failed/Degraded, exit code 137 (OOMKilled), immediate container exit, health-probe failures, image-pull failures, 403/unreachable, DNS (168.63.129.16), to scaler-unreachable — it explains, by system-log message, with procedures faithful to the official Microsoft Learn docs.

- Published: 2026-06-26
- Author: 友田 陽大
- Tags: Azure, Container Apps, トラブルシューティング, 可観測性, 信頼性, デバッグ, インフラ, コンテナ
- URL: https://tomodahinata.com/en/blog/azure-container-apps-troubleshooting-revision-failed-exit-code-137-probes-guide
- Category: Azure Container Apps in production
- Pillar guide: https://tomodahinata.com/en/blog/azure-container-apps-production-guide

## Key points

- First look at the Running status of 'revisions and replicas.' Failed/Degraded is a sign of deploy failure. Then read system logs (platform events) and app logs (stdout/stderr) separately.
- Exit code 137 = OOMKilled (the kernel killed it on memory overrun). Raise the memory limit or investigate the leak. An immediate exit with code 0 means the entrypoint/CMD isn't launching the service.
- 'Stuck at Provisioning/Processing' or 'Degraded after 10+ minutes (Deployment Progress Deadline Exceeded. 0/1 replicas ready)' is mostly a probe misconfiguration. Re-check the target-port match and the startup grace.
- The main causes of image-pull failure are a wrong tag, missing credentials, an unassigned AcrPull role, and network blocking. Note that pull validation runs only when the image reference changes.
- DNS needs care with VNet integration: custom DNS forwards unresolved queries to 168.63.129.16 and doesn't block login.microsoft.com or the registry FQDN. Diagnose and solve problems and Log stream are the entrance to first-line diagnosis.

---

"I deployed but the app won't start," "it was running but suddenly crashes" — a scene you always face in production operations. Azure Container Apps (ACA) has **the path from symptom to cause officially organized.** Reading one log message is faster than blindly repeating redeploys on a hunch.

This article summarizes **diagnosis and fixes by symptom**, faithful to Microsoft Learn's three official troubleshooting guides ([start failures](https://learn.microsoft.com/en-us/azure/container-apps/troubleshoot-container-start-failures), [image-pull failures](https://learn.microsoft.com/en-us/azure/container-apps/troubleshoot-image-pull-failures), [overall](https://learn.microsoft.com/en-us/azure/container-apps/troubleshooting)). For prerequisite knowledge of production operations, see the [Azure Container Apps production-operations guide](/blog/azure-container-apps-production-guide).

---

## The entrance to diagnosis: three tools

Whatever happens, look at these three first.

1. **The Running status of revisions and replicas** (portal → *Application* → *Revisions and replica*). `Failed` or `Degraded` is a deploy failure.
2. **Log stream** (portal → *Monitoring* → **Log stream**, not *Logs*). Switch between **Console** (the app's `stdout`/`stderr`) and **System** (platform events) to read.
3. **Diagnose and solve problems** (interactive diagnosis with no setup; categories like *Availability and Performance*).

> `If the Log stream page says This revision is scaled to zero., select the Go to Revision Management button. Deploy a new revision scaled to a minimum replica count of 1.` — **no logs appear while scaled to zero.** Redeploy with a minimum of 1 replica and then observe.

### System logs vs app logs

> There are two kinds of logs, system and application logs. System logs show Azure Container Apps' platform activities, where application logs show what is logged to `stdout` and `stderror`. (— [troubleshoot-container-start-failures](https://learn.microsoft.com/en-us/azure/container-apps/troubleshoot-container-start-failures))

**Separation is the crux.** Is it "the platform failed to start (system)" or "the app crashed with an exception (application)"? First confirm the system-log message.

| System-log message | Meaning |
|----------------------|-----|
| `ErrImagePull` | Image pull failed (absent, wrong reference, auth) |
| `Time-out` | Startup too long / a problem inside the container |
| `ContainerCrashing` | The container crashes repeatedly |
| `ingress routes not ready` | Ingress routes weren't ready and the replica failed |
| `Deployment Progress Deadline Exceeded. 0/1 replicas ready.` | Progress deadline exceeded. A startup or health-check problem |
| `Requests return status 403` | Access denied. Suspect the network configuration |
| `Error fetching scaler metrics` | Can't reach the scaling signal source |

---

## Symptom ①: exit code 137 (OOMKilled)

The most common "suddenly crashes" cause. **Exit code 137 = killed by the Linux kernel's OOM killer on memory overrun.**

> The Container Apps runtime can kill container that runs out of memory or CPU resources. Check system logs for out of memory (OOM) errors or throttling. (— [troubleshoot-container-start-failures](https://learn.microsoft.com/en-us/azure/container-apps/troubleshoot-container-start-failures))

**Fixes**:

1. **Raise the memory limit.** Since ACA has [CPU/memory as a fixed pair](/blog/azure-container-apps-production-guide) (memory = vCPU × 2), raising memory also means raising vCPU.
2. **Investigate the memory leak.** If it gradually climbs and dies at 137, raising the limit only prolongs it. See the slope (steady climb) of memory usage in Azure Monitor metrics.
3. **Move the in-memory cache outside the container.** Per the [principle of "hold no state in the container"](/blog/azure-container-apps-production-guide), offload a large cache to Redis.

> Official guidance: `If an app tends to experience out of memory errors or get killed, increase its memory limit or investigate memory leaks in the application.`

---

## Symptom ②: the container exits immediately (exit code 0 / nonzero)

It ends right after starting. There are two families of cause.

### Nonzero exit = an app crash

> If the container exited with a nonzero exit code, then an unhandled exception or error in your application code is often the problem. (— [troubleshoot-container-start-failures](https://learn.microsoft.com/en-us/azure/container-apps/troubleshoot-container-start-failures))

Unhandled exception, missing dependency, misconfiguration. Read the exception just before the crash in the **app log.** Reproducing **locally** with `docker run --rm <IMAGE>` is the shortest path.

### An exit-code-0 immediate exit = entrypoint/CMD

> If you overwrote the startup command or if your Dockerfile's entrypoint isn't launching the app correctly, the container could start and then immediately exit.

The startup command **ends without launching the service.** For Node.js, confirm that `CMD` starts the server; for Python, that it's a resident process. You can fix command/args in the portal's *Edit container*.

---

## Symptom ③: stuck at Provisioning/Processing, or Degraded

"Never finishes starting" or "Degraded after 10+ minutes" is **mostly a probe misconfiguration.**

> Revision is degraded: A new revision takes more than 10 minutes to provision. It finally has a Provision status of Provisioned, but a Running status of Degraded. The Running status tooltip reads `Details: Deployment Progress Deadline Exceeded. 0/1 replicas ready.` (— [troubleshooting](https://learn.microsoft.com/en-us/azure/container-apps/troubleshooting))

Causes and fixes:

1. **Target-port mismatch.** Does the health-probe (TCP) port match the app's listening port = the Ingress target port? This is the most frequent.
2. **Slow startup** (Java/JVM, etc.). Extend `initialDelaySeconds`. The default startup probe is `failureThreshold: 240` (240-second grace), but if that's still not enough, [customize the probe](/blog/azure-container-apps-production-guide).
3. **An HTTP probe on a worker that doesn't speak HTTP.** Disable Ingress (or make it internal) and remove the unnecessary HTTP probe.

Reference: the default probes when Ingress is enabled ([official](https://learn.microsoft.com/en-us/azure/container-apps/troubleshooting)).

| Property | Startup | Readiness |
|----------|---------|-----------|
| Protocol | TCP | TCP |
| Port | Ingress target port | Ingress target port |
| Initial delay | 1 sec | 3 sec |
| Failure threshold | 240 | 48 |

---

## Symptom ④: image-pull failure (ErrImagePull)

> An image pull failure occurs when the Azure platform is unable to download (or pull) the container image that the application requires. (— [troubleshoot-image-pull-failures](https://learn.microsoft.com/en-us/azure/container-apps/troubleshoot-image-pull-failures))

Main causes and fixes:

| Cause | Fix |
|------|-----|
| **Wrong image name/tag** | Confirm the exact name/tag. Verify locally with `docker pull <IMAGE>` |
| **Missing credentials (private registry)** | Are the credentials correct and not expired. **For ACR, is the `AcrPull` role assigned to the managed identity** |
| **Network blocking** | Can the environment reach the registry FQDN? Check DNS, NSG, firewall |
| **Rate limiting** (Docker Hub, etc.) | Wait a few minutes. Use a registry with ample quota like ACR |

> An important optimization trap: `Image pull validation is only performed when the container image reference changes during an update. When you update your container app with the same image reference, the validation process is skipped.` — **an update with the same image reference skips pull validation.** To re-validate, change the tag and update (so overwriting `latest` is a no-no).

---

## Symptom ⑤: can't connect to the endpoint / 403

The app runs but can't be reached from outside. Check the Ingress settings ([network design](/blog/azure-container-apps-networking-vnet-private-endpoint-waf-egress-guide)).

| Check | Action |
|------|----------|
| Is Ingress enabled | Check *Enabled* |
| Want to expose externally | Set *Ingress Traffic* to *Accepting traffic from anywhere*. If it doesn't speak HTTP, *Limited to Container Apps Environment* |
| Protocol | Is *Ingress type* HTTP/TCP correct |
| Target port | Does it match the app's listening port |
| IP restriction | Is the client IP denied in *IP Security Restrictions* |

`403` suspects the network configuration (IP restriction, internal-exposure mistake).

---

## Symptom ⑥: doesn't scale (Error fetching scaler metrics)

"The queue is piling up but replicas don't increase." If `Error fetching scaler metrics` appears in the system log, it's a sign **the scaling signal source (DB, Event Hub, another app) can't be reached.**

> Ensure that your application can connect to the source of your scaling signal. (— [troubleshoot-container-start-failures](https://learn.microsoft.com/en-us/azure/container-apps/troubleshoot-container-start-failures))

Check the VNet, DNS, firewall, and the scale rule's authentication (managed-identity role assignment). For scaling design itself, head to the [KEDA autoscale guide](/blog/azure-container-apps-keda-autoscaling-scale-to-zero-event-driven-guide).

---

## A cross-cutting cause: DNS and networking

A considerable proportion of VNet-integrated-environment trouble is **DNS.**

> Azure recursive resolvers uses the IP address `168.63.129.16` to resolve requests. (— [troubleshooting](https://learn.microsoft.com/en-us/azure/container-apps/troubleshooting))

- If you use custom DNS, **forward unresolved queries to `168.63.129.16`.**
- **Don't block `168.63.129.16`** in the NSG/firewall.
- When using a managed identity, ensure **reachability to `login.microsoft.com` (or `<REGION>.login.microsoft.com`).**
- The private registry's FQDN can be resolved and reached.

In an environment tightening egress with UDR + Firewall, forgetting to put these in the allowlist causes "image-pull failure," "scaler unreachable," and "token-acquisition failure" to happen **in a chain.**

---

## Can't delete the environment (ScheduledForDelete)

A VNet-integrated environment stays at `provisioningState: ScheduledForDelete` without disappearing — the cause is **the associated VNet remaining.**

```azurecli
# 1) 環境が使うサブネット/VNetを特定
az containerapp env show --resource-group <RG> --name <ENV>   # infrastructureSubnetId を確認
# 2) VNetを手動削除 → 3) 環境を削除
az network vnet delete --resource-group <RG> --name <VNET>
az containerapp env delete --resource-group <RG> --name <ENV> --yes
```

---

## If the CLI says "missing parameters"

If you get a mysterious "missing parameters" error in `az containerapp`, it's often just that **the extension is old.**

```azurecli
az extension add --name containerapp --upgrade
# プレビュー機能を使うなら
az extension add --name containerapp --upgrade --allow-preview true
```

---

## Diagnosis flow (summary)

```text
起動しない / 落ちる
  ├─ Revisions and replica の Running status を見る
  │    Failed/Degraded → デプロイ失敗
  ├─ System log のメッセージで切り分け
  │    ErrImagePull        → 症状④（イメージpull）
  │    ContainerCrashing   → 症状②（exit code / アプリログ）
  │    Deadline Exceeded   → 症状③（プローブ／ポート）
  │    403                 → 症状⑤（Ingress/ネットワーク）
  │    scaler metrics      → 症状⑥（信号源到達／DNS）
  ├─ Application log で例外を読む（stdout/stderr）
  ├─ docker run --rm <IMAGE> でローカル再現
  └─ Diagnose and solve problems で対話的に絞る
```

---

## Summary

ACA trouble has the path **symptom → system-log message → cause** officially organized. The pressure points to remember are — **137 is OOM**, **an exit-0 immediate exit is the entrypoint**, **stuck at Provisioning / Degraded is probes & ports**, **ErrImagePull is the tag, AcrPull, network**, **can't connect/403 is Ingress**, and **doesn't scale + VNet is DNS.** Before blindly repeating redeploys, read one line of the log — that's the fastest recovery.

For incident response in production operations, observability design, and recurrence prevention, [contact me](/contact). For the full design picture, see the [Azure Container Apps production-operations guide](/blog/azure-container-apps-production-guide).
