Skip to main content
友田 陽大
Azure Container Apps in production
Azure
Container Apps
トラブルシューティング
可観測性
信頼性
デバッグ
インフラ
コンテナ

Azure Container Apps troubleshooting: diagnosing and fixing revision Failed/Degraded, exit code 137, probes, and image-pull failures

A systematic guide to diagnosing and fixing when Azure Container Apps won't start or crashes. From revision Failed/Degraded, exit code 137 (OOMKilled), immediate container exit, health-probe failures, image-pull failures, 403/unreachable, DNS (168.63.129.16), to scaler-unreachable — it explains, by system-log message, with procedures faithful to the official Microsoft Learn docs.

Published
Reading time
8 min read
Author
友田 陽大
Share

"I deployed but the app won't start," "it was running but suddenly crashes" — a scene you always face in production operations. Azure Container Apps (ACA) has the path from symptom to cause officially organized. Reading one log message is faster than blindly repeating redeploys on a hunch.

This article summarizes diagnosis and fixes by symptom, faithful to Microsoft Learn's three official troubleshooting guides (start failures, image-pull failures, overall). For prerequisite knowledge of production operations, see the Azure Container Apps production-operations guide.


The entrance to diagnosis: three tools

Whatever happens, look at these three first.

  1. The Running status of revisions and replicas (portal → ApplicationRevisions and replica). Failed or Degraded is a deploy failure.
  2. Log stream (portal → MonitoringLog stream, not Logs). Switch between Console (the app's stdout/stderr) and System (platform events) to read.
  3. Diagnose and solve problems (interactive diagnosis with no setup; categories like Availability and Performance).

If the Log stream page says This revision is scaled to zero., select the Go to Revision Management button. Deploy a new revision scaled to a minimum replica count of 1.no logs appear while scaled to zero. Redeploy with a minimum of 1 replica and then observe.

System logs vs app logs

There are two kinds of logs, system and application logs. System logs show Azure Container Apps' platform activities, where application logs show what is logged to stdout and stderror. (— troubleshoot-container-start-failures)

Separation is the crux. Is it "the platform failed to start (system)" or "the app crashed with an exception (application)"? First confirm the system-log message.

System-log messageMeaning
ErrImagePullImage pull failed (absent, wrong reference, auth)
Time-outStartup too long / a problem inside the container
ContainerCrashingThe container crashes repeatedly
ingress routes not readyIngress routes weren't ready and the replica failed
Deployment Progress Deadline Exceeded. 0/1 replicas ready.Progress deadline exceeded. A startup or health-check problem
Requests return status 403Access denied. Suspect the network configuration
Error fetching scaler metricsCan't reach the scaling signal source

Symptom ①: exit code 137 (OOMKilled)

The most common "suddenly crashes" cause. Exit code 137 = killed by the Linux kernel's OOM killer on memory overrun.

The Container Apps runtime can kill container that runs out of memory or CPU resources. Check system logs for out of memory (OOM) errors or throttling. (— troubleshoot-container-start-failures)

Fixes:

  1. Raise the memory limit. Since ACA has CPU/memory as a fixed pair (memory = vCPU × 2), raising memory also means raising vCPU.
  2. Investigate the memory leak. If it gradually climbs and dies at 137, raising the limit only prolongs it. See the slope (steady climb) of memory usage in Azure Monitor metrics.
  3. Move the in-memory cache outside the container. Per the principle of "hold no state in the container", offload a large cache to Redis.

Official guidance: If an app tends to experience out of memory errors or get killed, increase its memory limit or investigate memory leaks in the application.


Symptom ②: the container exits immediately (exit code 0 / nonzero)

It ends right after starting. There are two families of cause.

Nonzero exit = an app crash

If the container exited with a nonzero exit code, then an unhandled exception or error in your application code is often the problem. (— troubleshoot-container-start-failures)

Unhandled exception, missing dependency, misconfiguration. Read the exception just before the crash in the app log. Reproducing locally with docker run --rm <IMAGE> is the shortest path.

An exit-code-0 immediate exit = entrypoint/CMD

If you overwrote the startup command or if your Dockerfile's entrypoint isn't launching the app correctly, the container could start and then immediately exit.

The startup command ends without launching the service. For Node.js, confirm that CMD starts the server; for Python, that it's a resident process. You can fix command/args in the portal's Edit container.


Symptom ③: stuck at Provisioning/Processing, or Degraded

"Never finishes starting" or "Degraded after 10+ minutes" is mostly a probe misconfiguration.

Revision is degraded: A new revision takes more than 10 minutes to provision. It finally has a Provision status of Provisioned, but a Running status of Degraded. The Running status tooltip reads Details: Deployment Progress Deadline Exceeded. 0/1 replicas ready. (— troubleshooting)

Causes and fixes:

  1. Target-port mismatch. Does the health-probe (TCP) port match the app's listening port = the Ingress target port? This is the most frequent.
  2. Slow startup (Java/JVM, etc.). Extend initialDelaySeconds. The default startup probe is failureThreshold: 240 (240-second grace), but if that's still not enough, customize the probe.
  3. An HTTP probe on a worker that doesn't speak HTTP. Disable Ingress (or make it internal) and remove the unnecessary HTTP probe.

Reference: the default probes when Ingress is enabled (official).

PropertyStartupReadiness
ProtocolTCPTCP
PortIngress target portIngress target port
Initial delay1 sec3 sec
Failure threshold24048

Symptom ④: image-pull failure (ErrImagePull)

An image pull failure occurs when the Azure platform is unable to download (or pull) the container image that the application requires. (— troubleshoot-image-pull-failures)

Main causes and fixes:

CauseFix
Wrong image name/tagConfirm the exact name/tag. Verify locally with docker pull <IMAGE>
Missing credentials (private registry)Are the credentials correct and not expired. For ACR, is the AcrPull role assigned to the managed identity
Network blockingCan the environment reach the registry FQDN? Check DNS, NSG, firewall
Rate limiting (Docker Hub, etc.)Wait a few minutes. Use a registry with ample quota like ACR

An important optimization trap: Image pull validation is only performed when the container image reference changes during an update. When you update your container app with the same image reference, the validation process is skipped.an update with the same image reference skips pull validation. To re-validate, change the tag and update (so overwriting latest is a no-no).


Symptom ⑤: can't connect to the endpoint / 403

The app runs but can't be reached from outside. Check the Ingress settings (network design).

CheckAction
Is Ingress enabledCheck Enabled
Want to expose externallySet Ingress Traffic to Accepting traffic from anywhere. If it doesn't speak HTTP, Limited to Container Apps Environment
ProtocolIs Ingress type HTTP/TCP correct
Target portDoes it match the app's listening port
IP restrictionIs the client IP denied in IP Security Restrictions

403 suspects the network configuration (IP restriction, internal-exposure mistake).


Symptom ⑥: doesn't scale (Error fetching scaler metrics)

"The queue is piling up but replicas don't increase." If Error fetching scaler metrics appears in the system log, it's a sign the scaling signal source (DB, Event Hub, another app) can't be reached.

Ensure that your application can connect to the source of your scaling signal. (— troubleshoot-container-start-failures)

Check the VNet, DNS, firewall, and the scale rule's authentication (managed-identity role assignment). For scaling design itself, head to the KEDA autoscale guide.


A cross-cutting cause: DNS and networking

A considerable proportion of VNet-integrated-environment trouble is DNS.

Azure recursive resolvers uses the IP address 168.63.129.16 to resolve requests. (— troubleshooting)

  • If you use custom DNS, forward unresolved queries to 168.63.129.16.
  • Don't block 168.63.129.16 in the NSG/firewall.
  • When using a managed identity, ensure reachability to login.microsoft.com (or <REGION>.login.microsoft.com).
  • The private registry's FQDN can be resolved and reached.

In an environment tightening egress with UDR + Firewall, forgetting to put these in the allowlist causes "image-pull failure," "scaler unreachable," and "token-acquisition failure" to happen in a chain.


Can't delete the environment (ScheduledForDelete)

A VNet-integrated environment stays at provisioningState: ScheduledForDelete without disappearing — the cause is the associated VNet remaining.

# 1) 環境が使うサブネット/VNetを特定
az containerapp env show --resource-group <RG> --name <ENV>   # infrastructureSubnetId を確認
# 2) VNetを手動削除 → 3) 環境を削除
az network vnet delete --resource-group <RG> --name <VNET>
az containerapp env delete --resource-group <RG> --name <ENV> --yes

If the CLI says "missing parameters"

If you get a mysterious "missing parameters" error in az containerapp, it's often just that the extension is old.

az extension add --name containerapp --upgrade
# プレビュー機能を使うなら
az extension add --name containerapp --upgrade --allow-preview true

Diagnosis flow (summary)

起動しない / 落ちる
  ├─ Revisions and replica の Running status を見る
  │    Failed/Degraded → デプロイ失敗
  ├─ System log のメッセージで切り分け
  │    ErrImagePull        → 症状④(イメージpull)
  │    ContainerCrashing   → 症状②(exit code / アプリログ)
  │    Deadline Exceeded   → 症状③(プローブ/ポート)
  │    403                 → 症状⑤(Ingress/ネットワーク)
  │    scaler metrics      → 症状⑥(信号源到達/DNS)
  ├─ Application log で例外を読む(stdout/stderr)
  ├─ docker run --rm <IMAGE> でローカル再現
  └─ Diagnose and solve problems で対話的に絞る

Summary

ACA trouble has the path symptom → system-log message → cause officially organized. The pressure points to remember are — 137 is OOM, an exit-0 immediate exit is the entrypoint, stuck at Provisioning / Degraded is probes & ports, ErrImagePull is the tag, AcrPull, network, can't connect/403 is Ingress, and doesn't scale + VNet is DNS. Before blindly repeating redeploys, read one line of the log — that's the fastest recovery.

For incident response in production operations, observability design, and recurrence prevention, contact me. For the full design picture, see the Azure Container Apps production-operations guide.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading