Azure Container Apps troubleshooting: diagnosing and fixing revision Failed/Degraded, exit code 137, probes, and image-pull failures

"I deployed but the app won't start," "it was running but suddenly crashes" — a scene you always face in production operations. Azure Container Apps (ACA) has the path from symptom to cause officially organized. Reading one log message is faster than blindly repeating redeploys on a hunch.

This article summarizes diagnosis and fixes by symptom, faithful to Microsoft Learn's three official troubleshooting guides (start failures, image-pull failures, overall). For prerequisite knowledge of production operations, see the Azure Container Apps production-operations guide.

The entrance to diagnosis: three tools

Whatever happens, look at these three first.

The Running status of revisions and replicas (portal → Application → Revisions and replica). Failed or Degraded is a deploy failure.
Log stream (portal → Monitoring → Log stream, not Logs). Switch between Console (the app's stdout/stderr) and System (platform events) to read.
Diagnose and solve problems (interactive diagnosis with no setup; categories like Availability and Performance).

If the Log stream page says This revision is scaled to zero., select the Go to Revision Management button. Deploy a new revision scaled to a minimum replica count of 1. — no logs appear while scaled to zero. Redeploy with a minimum of 1 replica and then observe.

System logs vs app logs

There are two kinds of logs, system and application logs. System logs show Azure Container Apps' platform activities, where application logs show what is logged to stdout and stderror. (— troubleshoot-container-start-failures)

Separation is the crux. Is it "the platform failed to start (system)" or "the app crashed with an exception (application)"? First confirm the system-log message.

System-log message	Meaning
`ErrImagePull`	Image pull failed (absent, wrong reference, auth)
`Time-out`	Startup too long / a problem inside the container
`ContainerCrashing`	The container crashes repeatedly
`ingress routes not ready`	Ingress routes weren't ready and the replica failed
`Deployment Progress Deadline Exceeded. 0/1 replicas ready.`	Progress deadline exceeded. A startup or health-check problem
`Requests return status 403`	Access denied. Suspect the network configuration
`Error fetching scaler metrics`	Can't reach the scaling signal source

Symptom ①: exit code 137 (OOMKilled)

The most common "suddenly crashes" cause. Exit code 137 = killed by the Linux kernel's OOM killer on memory overrun.

The Container Apps runtime can kill container that runs out of memory or CPU resources. Check system logs for out of memory (OOM) errors or throttling. (— troubleshoot-container-start-failures)

Fixes:

Raise the memory limit. Since ACA has CPU/memory as a fixed pair (memory = vCPU × 2), raising memory also means raising vCPU.
Investigate the memory leak. If it gradually climbs and dies at 137, raising the limit only prolongs it. See the slope (steady climb) of memory usage in Azure Monitor metrics.
Move the in-memory cache outside the container. Per the principle of "hold no state in the container", offload a large cache to Redis.

Official guidance: If an app tends to experience out of memory errors or get killed, increase its memory limit or investigate memory leaks in the application.

Symptom ②: the container exits immediately (exit code 0 / nonzero)

It ends right after starting. There are two families of cause.

Nonzero exit = an app crash

If the container exited with a nonzero exit code, then an unhandled exception or error in your application code is often the problem. (— troubleshoot-container-start-failures)

Unhandled exception, missing dependency, misconfiguration. Read the exception just before the crash in the app log. Reproducing locally with docker run --rm <IMAGE> is the shortest path.

An exit-code-0 immediate exit = entrypoint/CMD

If you overwrote the startup command or if your Dockerfile's entrypoint isn't launching the app correctly, the container could start and then immediately exit.

The startup command ends without launching the service. For Node.js, confirm that CMD starts the server; for Python, that it's a resident process. You can fix command/args in the portal's Edit container.

Symptom ③: stuck at Provisioning/Processing, or Degraded

"Never finishes starting" or "Degraded after 10+ minutes" is mostly a probe misconfiguration.

Revision is degraded: A new revision takes more than 10 minutes to provision. It finally has a Provision status of Provisioned, but a Running status of Degraded. The Running status tooltip reads Details: Deployment Progress Deadline Exceeded. 0/1 replicas ready. (— troubleshooting)

Causes and fixes:

Target-port mismatch. Does the health-probe (TCP) port match the app's listening port = the Ingress target port? This is the most frequent.
Slow startup (Java/JVM, etc.). Extend initialDelaySeconds. The default startup probe is failureThreshold: 240 (240-second grace), but if that's still not enough, customize the probe.
An HTTP probe on a worker that doesn't speak HTTP. Disable Ingress (or make it internal) and remove the unnecessary HTTP probe.

Reference: the default probes when Ingress is enabled (official).

Property	Startup	Readiness
Protocol	TCP	TCP
Port	Ingress target port	Ingress target port
Initial delay	1 sec	3 sec
Failure threshold	240	48

Symptom ④: image-pull failure (ErrImagePull)

An image pull failure occurs when the Azure platform is unable to download (or pull) the container image that the application requires. (— troubleshoot-image-pull-failures)

Main causes and fixes:

Cause	Fix
Wrong image name/tag	Confirm the exact name/tag. Verify locally with `docker pull <IMAGE>`
Missing credentials (private registry)	Are the credentials correct and not expired. For ACR, is the `AcrPull` role assigned to the managed identity
Network blocking	Can the environment reach the registry FQDN? Check DNS, NSG, firewall
Rate limiting (Docker Hub, etc.)	Wait a few minutes. Use a registry with ample quota like ACR

An important optimization trap: Image pull validation is only performed when the container image reference changes during an update. When you update your container app with the same image reference, the validation process is skipped. — an update with the same image reference skips pull validation. To re-validate, change the tag and update (so overwriting latest is a no-no).

Symptom ⑤: can't connect to the endpoint / 403

The app runs but can't be reached from outside. Check the Ingress settings (network design).

Check	Action
Is Ingress enabled	Check Enabled
Want to expose externally	Set Ingress Traffic to Accepting traffic from anywhere. If it doesn't speak HTTP, Limited to Container Apps Environment
Protocol	Is Ingress type HTTP/TCP correct
Target port	Does it match the app's listening port
IP restriction	Is the client IP denied in IP Security Restrictions

403 suspects the network configuration (IP restriction, internal-exposure mistake).

Symptom ⑥: doesn't scale (Error fetching scaler metrics)

"The queue is piling up but replicas don't increase." If Error fetching scaler metrics appears in the system log, it's a sign the scaling signal source (DB, Event Hub, another app) can't be reached.

Ensure that your application can connect to the source of your scaling signal. (— troubleshoot-container-start-failures)

Check the VNet, DNS, firewall, and the scale rule's authentication (managed-identity role assignment). For scaling design itself, head to the KEDA autoscale guide.

A cross-cutting cause: DNS and networking

A considerable proportion of VNet-integrated-environment trouble is DNS.

Azure recursive resolvers uses the IP address 168.63.129.16 to resolve requests. (— troubleshooting)

If you use custom DNS, forward unresolved queries to 168.63.129.16.
Don't block 168.63.129.16 in the NSG/firewall.
When using a managed identity, ensure reachability to login.microsoft.com (or <REGION>.login.microsoft.com).
The private registry's FQDN can be resolved and reached.

In an environment tightening egress with UDR + Firewall, forgetting to put these in the allowlist causes "image-pull failure," "scaler unreachable," and "token-acquisition failure" to happen in a chain.

Can't delete the environment (ScheduledForDelete)

A VNet-integrated environment stays at provisioningState: ScheduledForDelete without disappearing — the cause is the associated VNet remaining.

# 1) 環境が使うサブネット/VNetを特定
az containerapp env show --resource-group <RG> --name <ENV>   # infrastructureSubnetId を確認
# 2) VNetを手動削除 → 3) 環境を削除
az network vnet delete --resource-group <RG> --name <VNET>
az containerapp env delete --resource-group <RG> --name <ENV> --yes

If the CLI says "missing parameters"

If you get a mysterious "missing parameters" error in az containerapp, it's often just that the extension is old.

az extension add --name containerapp --upgrade
# プレビュー機能を使うなら
az extension add --name containerapp --upgrade --allow-preview true

Diagnosis flow (summary)

起動しない / 落ちる
  ├─ Revisions and replica の Running status を見る
  │    Failed/Degraded → デプロイ失敗
  ├─ System log のメッセージで切り分け
  │    ErrImagePull        → 症状④（イメージpull）
  │    ContainerCrashing   → 症状②（exit code / アプリログ）
  │    Deadline Exceeded   → 症状③（プローブ／ポート）
  │    403                 → 症状⑤（Ingress/ネットワーク）
  │    scaler metrics      → 症状⑥（信号源到達／DNS）
  ├─ Application log で例外を読む（stdout/stderr）
  ├─ docker run --rm <IMAGE> でローカル再現
  └─ Diagnose and solve problems で対話的に絞る

Summary

ACA trouble has the path symptom → system-log message → cause officially organized. The pressure points to remember are — 137 is OOM, an exit-0 immediate exit is the entrypoint, stuck at Provisioning / Degraded is probes & ports, ErrImagePull is the tag, AcrPull, network, can't connect/403 is Ingress, and doesn't scale + VNet is DNS. Before blindly repeating redeploys, read one line of the log — that's the fastest recovery.

For incident response in production operations, observability design, and recurrence prevention, contact me. For the full design picture, see the Azure Container Apps production-operations guide.

Azure Container Apps troubleshooting: diagnosing and fixing revision Failed/Degraded, exit code 137, probes, and image-pull failures

The entrance to diagnosis: three tools

System logs vs app logs

Symptom ①: exit code 137 (OOMKilled)

Symptom ②: the container exits immediately (exit code 0 / nonzero)

Nonzero exit = an app crash

An exit-code-0 immediate exit = entrypoint/CMD

Symptom ③: stuck at Provisioning/Processing, or Degraded

Symptom ④: image-pull failure (ErrImagePull)

Symptom ⑤: can't connect to the endpoint / 403

Symptom ⑥: doesn't scale (Error fetching scaler metrics)

A cross-cutting cause: DNS and networking

Can't delete the environment (ScheduledForDelete)

If the CLI says "missing parameters"

Diagnosis flow (summary)

Summary

Azure Container Apps Production Operations Guide: Designing, Scaling, Deploying, Costing, and Securing Serverless Containers, with Real Code

Azure Container Apps CI/CD guide: deploy safely and automatically with GitHub Actions, OIDC keyless, Bicep, and Blue/Green revisions

Azure Container Apps Jobs implementation guide: production design for batch, schedule (cron), and event-driven

The complete Azure Container Apps autoscaling guide: scale-to-zero and event-driven with KEDA (HTTP, queue, CPU)

Also worth reading

Cloud Run troubleshooting compendium: causes and fixes for start failures, 503/504, OOM (exit 137), cold starts, and deploy failures

ECS on Fargate Troubleshooting Complete Guide: Diagnosing and Fixing Why Tasks Won't Start or Die Immediately, by Stop-Reason Code

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code

The entrance to diagnosis: three tools

System logs vs app logs

Symptom ①: exit code 137 (OOMKilled)

Symptom ②: the container exits immediately (exit code 0 / nonzero)

Nonzero exit = an app crash

An exit-code-0 immediate exit = entrypoint/CMD

Symptom ③: stuck at Provisioning/Processing, or Degraded

Symptom ④: image-pull failure (ErrImagePull)

Symptom ⑤: can't connect to the endpoint / 403

Symptom ⑥: doesn't scale (Error fetching scaler metrics)

A cross-cutting cause: DNS and networking

Can't delete the environment (ScheduledForDelete)

If the CLI says "missing parameters"

Diagnosis flow (summary)

Summary

Related articles

Azure Container Apps Production Operations Guide: Designing, Scaling, Deploying, Costing, and Securing Serverless Containers, with Real Code

Azure Container Apps CI/CD guide: deploy safely and automatically with GitHub Actions, OIDC keyless, Bicep, and Blue/Green revisions

Azure Container Apps Jobs implementation guide: production design for batch, schedule (cron), and event-driven

The complete Azure Container Apps autoscaling guide: scale-to-zero and event-driven with KEDA (HTTP, queue, CPU)

Also worth reading

Cloud Run troubleshooting compendium: causes and fixes for start failures, 503/504, OOM (exit 137), cold starts, and deploy failures

ECS on Fargate Troubleshooting Complete Guide: Diagnosing and Fixing Why Tasks Won't Start or Die Immediately, by Stop-Reason Code

Google Cloud Run Production-Operations Guide: Container Contract, Concurrency, Auto-Scale, Deploy, Cost, and Security in Real Code