"I deployed but the app won't start," "it was running but suddenly crashes" — a scene you always face in production operations. Azure Container Apps (ACA) has the path from symptom to cause officially organized. Reading one log message is faster than blindly repeating redeploys on a hunch.
This article summarizes diagnosis and fixes by symptom, faithful to Microsoft Learn's three official troubleshooting guides (start failures, image-pull failures, overall). For prerequisite knowledge of production operations, see the Azure Container Apps production-operations guide.
The entrance to diagnosis: three tools
Whatever happens, look at these three first.
- The Running status of revisions and replicas (portal → Application → Revisions and replica).
FailedorDegradedis a deploy failure. - Log stream (portal → Monitoring → Log stream, not Logs). Switch between Console (the app's
stdout/stderr) and System (platform events) to read. - Diagnose and solve problems (interactive diagnosis with no setup; categories like Availability and Performance).
If the Log stream page says This revision is scaled to zero., select the Go to Revision Management button. Deploy a new revision scaled to a minimum replica count of 1.— no logs appear while scaled to zero. Redeploy with a minimum of 1 replica and then observe.
System logs vs app logs
There are two kinds of logs, system and application logs. System logs show Azure Container Apps' platform activities, where application logs show what is logged to
stdoutandstderror. (— troubleshoot-container-start-failures)
Separation is the crux. Is it "the platform failed to start (system)" or "the app crashed with an exception (application)"? First confirm the system-log message.
| System-log message | Meaning |
|---|---|
ErrImagePull | Image pull failed (absent, wrong reference, auth) |
Time-out | Startup too long / a problem inside the container |
ContainerCrashing | The container crashes repeatedly |
ingress routes not ready | Ingress routes weren't ready and the replica failed |
Deployment Progress Deadline Exceeded. 0/1 replicas ready. | Progress deadline exceeded. A startup or health-check problem |
Requests return status 403 | Access denied. Suspect the network configuration |
Error fetching scaler metrics | Can't reach the scaling signal source |
Symptom ①: exit code 137 (OOMKilled)
The most common "suddenly crashes" cause. Exit code 137 = killed by the Linux kernel's OOM killer on memory overrun.
The Container Apps runtime can kill container that runs out of memory or CPU resources. Check system logs for out of memory (OOM) errors or throttling. (— troubleshoot-container-start-failures)
Fixes:
- Raise the memory limit. Since ACA has CPU/memory as a fixed pair (memory = vCPU × 2), raising memory also means raising vCPU.
- Investigate the memory leak. If it gradually climbs and dies at 137, raising the limit only prolongs it. See the slope (steady climb) of memory usage in Azure Monitor metrics.
- Move the in-memory cache outside the container. Per the principle of "hold no state in the container", offload a large cache to Redis.
Official guidance:
If an app tends to experience out of memory errors or get killed, increase its memory limit or investigate memory leaks in the application.
Symptom ②: the container exits immediately (exit code 0 / nonzero)
It ends right after starting. There are two families of cause.
Nonzero exit = an app crash
If the container exited with a nonzero exit code, then an unhandled exception or error in your application code is often the problem. (— troubleshoot-container-start-failures)
Unhandled exception, missing dependency, misconfiguration. Read the exception just before the crash in the app log. Reproducing locally with docker run --rm <IMAGE> is the shortest path.
An exit-code-0 immediate exit = entrypoint/CMD
If you overwrote the startup command or if your Dockerfile's entrypoint isn't launching the app correctly, the container could start and then immediately exit.
The startup command ends without launching the service. For Node.js, confirm that CMD starts the server; for Python, that it's a resident process. You can fix command/args in the portal's Edit container.
Symptom ③: stuck at Provisioning/Processing, or Degraded
"Never finishes starting" or "Degraded after 10+ minutes" is mostly a probe misconfiguration.
Revision is degraded: A new revision takes more than 10 minutes to provision. It finally has a Provision status of Provisioned, but a Running status of Degraded. The Running status tooltip reads
Details: Deployment Progress Deadline Exceeded. 0/1 replicas ready.(— troubleshooting)
Causes and fixes:
- Target-port mismatch. Does the health-probe (TCP) port match the app's listening port = the Ingress target port? This is the most frequent.
- Slow startup (Java/JVM, etc.). Extend
initialDelaySeconds. The default startup probe isfailureThreshold: 240(240-second grace), but if that's still not enough, customize the probe. - An HTTP probe on a worker that doesn't speak HTTP. Disable Ingress (or make it internal) and remove the unnecessary HTTP probe.
Reference: the default probes when Ingress is enabled (official).
| Property | Startup | Readiness |
|---|---|---|
| Protocol | TCP | TCP |
| Port | Ingress target port | Ingress target port |
| Initial delay | 1 sec | 3 sec |
| Failure threshold | 240 | 48 |
Symptom ④: image-pull failure (ErrImagePull)
An image pull failure occurs when the Azure platform is unable to download (or pull) the container image that the application requires. (— troubleshoot-image-pull-failures)
Main causes and fixes:
| Cause | Fix |
|---|---|
| Wrong image name/tag | Confirm the exact name/tag. Verify locally with docker pull <IMAGE> |
| Missing credentials (private registry) | Are the credentials correct and not expired. For ACR, is the AcrPull role assigned to the managed identity |
| Network blocking | Can the environment reach the registry FQDN? Check DNS, NSG, firewall |
| Rate limiting (Docker Hub, etc.) | Wait a few minutes. Use a registry with ample quota like ACR |
An important optimization trap:
Image pull validation is only performed when the container image reference changes during an update. When you update your container app with the same image reference, the validation process is skipped.— an update with the same image reference skips pull validation. To re-validate, change the tag and update (so overwritinglatestis a no-no).
Symptom ⑤: can't connect to the endpoint / 403
The app runs but can't be reached from outside. Check the Ingress settings (network design).
| Check | Action |
|---|---|
| Is Ingress enabled | Check Enabled |
| Want to expose externally | Set Ingress Traffic to Accepting traffic from anywhere. If it doesn't speak HTTP, Limited to Container Apps Environment |
| Protocol | Is Ingress type HTTP/TCP correct |
| Target port | Does it match the app's listening port |
| IP restriction | Is the client IP denied in IP Security Restrictions |
403 suspects the network configuration (IP restriction, internal-exposure mistake).
Symptom ⑥: doesn't scale (Error fetching scaler metrics)
"The queue is piling up but replicas don't increase." If Error fetching scaler metrics appears in the system log, it's a sign the scaling signal source (DB, Event Hub, another app) can't be reached.
Ensure that your application can connect to the source of your scaling signal. (— troubleshoot-container-start-failures)
Check the VNet, DNS, firewall, and the scale rule's authentication (managed-identity role assignment). For scaling design itself, head to the KEDA autoscale guide.
A cross-cutting cause: DNS and networking
A considerable proportion of VNet-integrated-environment trouble is DNS.
Azure recursive resolvers uses the IP address
168.63.129.16to resolve requests. (— troubleshooting)
- If you use custom DNS, forward unresolved queries to
168.63.129.16. - Don't block
168.63.129.16in the NSG/firewall. - When using a managed identity, ensure reachability to
login.microsoft.com(or<REGION>.login.microsoft.com). - The private registry's FQDN can be resolved and reached.
In an environment tightening egress with UDR + Firewall, forgetting to put these in the allowlist causes "image-pull failure," "scaler unreachable," and "token-acquisition failure" to happen in a chain.
Can't delete the environment (ScheduledForDelete)
A VNet-integrated environment stays at provisioningState: ScheduledForDelete without disappearing — the cause is the associated VNet remaining.
# 1) 環境が使うサブネット/VNetを特定
az containerapp env show --resource-group <RG> --name <ENV> # infrastructureSubnetId を確認
# 2) VNetを手動削除 → 3) 環境を削除
az network vnet delete --resource-group <RG> --name <VNET>
az containerapp env delete --resource-group <RG> --name <ENV> --yes
If the CLI says "missing parameters"
If you get a mysterious "missing parameters" error in az containerapp, it's often just that the extension is old.
az extension add --name containerapp --upgrade
# プレビュー機能を使うなら
az extension add --name containerapp --upgrade --allow-preview true
Diagnosis flow (summary)
起動しない / 落ちる
├─ Revisions and replica の Running status を見る
│ Failed/Degraded → デプロイ失敗
├─ System log のメッセージで切り分け
│ ErrImagePull → 症状④(イメージpull)
│ ContainerCrashing → 症状②(exit code / アプリログ)
│ Deadline Exceeded → 症状③(プローブ/ポート)
│ 403 → 症状⑤(Ingress/ネットワーク)
│ scaler metrics → 症状⑥(信号源到達/DNS)
├─ Application log で例外を読む(stdout/stderr)
├─ docker run --rm <IMAGE> でローカル再現
└─ Diagnose and solve problems で対話的に絞る
Summary
ACA trouble has the path symptom → system-log message → cause officially organized. The pressure points to remember are — 137 is OOM, an exit-0 immediate exit is the entrypoint, stuck at Provisioning / Degraded is probes & ports, ErrImagePull is the tag, AcrPull, network, can't connect/403 is Ingress, and doesn't scale + VNet is DNS. Before blindly repeating redeploys, read one line of the log — that's the fastest recovery.
For incident response in production operations, observability design, and recurrence prevention, contact me. For the full design picture, see the Azure Container Apps production-operations guide.