# Complete UVR5 / audio-separator troubleshooting guide (GPU not used, CUDA, OOM, installation)

> 'GPU isn't used and it's painfully slow,' 'CUDA out of memory,' 'cuDNN errors,' 'ffmpeg is missing,' 'the model is downloaded every time' — solves the symptoms commonly stuck on in source separation with UVR5 and audio-separator, by symptom, based on ONNX Runtime/PyTorch official facts, from diagnostic commands to concrete fix procedures.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: UVR5, audio-separator, GPU, CUDA, ONNX, 音源分離, Python, AI音声
- URL: https://tomodahinata.com/en/blog/uvr5-audio-separator-troubleshooting-gpu-cuda-oom
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- The most common is 'GPU isn't used and it's painfully slow on CPU.' The cause is the coexistence of onnxruntime and onnxruntime-gpu, or a CUDA/cuDNN mismatch. First diagnose with onnxruntime.get_available_providers() and audio-separator --env_info.
- ONNX Runtime silently falls back to CPU if CUDA can't be used by the providers priority. cuDNN 8.x and 9.x are incompatible, and PyPI's default is CUDA 12.x from 1.19. Match the version table.
- For CUDA out of memory, the basic is lowering segment_size/batch_size. empty_cache() only frees unused cache, and the VRAM held by tensors doesn't return.
- ffmpeg is essential for audio I/O. The model is auto-DL'd to /tmp/audio-separator-models/ on first use, and in a volatile environment it re-DLs every time, so bake it into the image / persist it.
- For reinstall, remove torch and onnxruntime once and reinstall. Don't mix the CPU/GPU versions of both — this is the main cause of the silent CPU fallback.

---

## How to use this article

[UVR5 / MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) and [audio-separator](https://github.com/nomadkaraoke/python-audio-separator) are tools that are **instant in a working environment but get stuck endlessly in a non-working one.** And the sticking points are nearly fixed — **GPU not used, CUDA out of memory, cuDNN error, ffmpeg missing, the model downloaded every time.**

This article is a **troubleshooting collection you can look up in reverse from the symptom.** For each symptom, it shows "**what to run first to diagnose → cause → concrete fix**," **based on the official information** of ONNX Runtime / PyTorch / audio-separator.

> **About the author (reliability disclosure)**: I have **single-handedly designed, implemented, and operate in production an AI video-localization foundation** with source separation as the first stage. The fix procedures in this article are a record of what I've **actually stepped on and fixed** in local, Colab, container, and production GPU environments. The scale design in production is summarized in the [GPU-worker-foundation article](/blog/music-source-separation-production-api-gpu-worker-queue), and the tool's big picture in the [UVR5 guide](/blog/uvr5-mdx-net-vocal-separation-production-guide).

---

## Symptom quick reference (place a bet in 30 seconds)

| Symptom | Most likely cause | Immediate move |
| --- | --- | --- |
| **Painfully slow despite GPU** | coexistence of `onnxruntime` and `onnxruntime-gpu` / CUDA·cuDNN mismatch | [① CPU fallback](#most-common-gpu-not-used-painfully-slow-on-cpu) |
| **CUDA out of memory** | `segment_size` / `batch_size` too large, long clip | [② OOM](#cuda-out-of-memory-crashes) |
| **cuDNN / CUDA error** | ORT and CUDA/cuDNN version incompatibility | [③ version mismatch](#cudacudnn-version-errors) |
| **ffmpeg-related error** | ffmpeg not installed | [④ ffmpeg](#ffmpeg-not-found-audio-io-failure) |
| **Slow with model DL every time** | `/tmp` volatile, cache not persisted | [⑤ model DL](#the-model-is-downloaded-every-time-slow) |
| **Slow/won't work on Mac** | Apple Silicon doesn't support CUDA (MPS/CPU) | [⑥ Apple Silicon](#doesnt-work-or-is-slow-on-apple-silicon-mac) |

---

## First, diagnose: confirm the environment mechanically

Before fixing by intuition, **acquire the facts.** The first commands to throw are these.

```bash
# audio-separator公式の環境診断（最優先）。CUDA/ffmpegの可否を一発で出す
audio-separator --env_info
# 期待する出力例:
#   "ONNXruntime has CUDAExecutionProvider available, enabling acceleration"
#   "FFmpeg installed"

# GPUドライバ自体が見えているか
nvidia-smi
```

Further, confirm **both ONNX Runtime and PyTorch** from Python. Because audio-separator runs on **2 lines — ONNX Runtime (`.onnx` for MDX-Net, etc.) and PyTorch (`.ckpt` for Demucs, etc.)** — you need to look at both.

```python
# diagnose.py — GPUが本当に効いているかを機械的に確認する
import onnxruntime as ort

print("ORT version :", ort.__version__)
print("ORT device  :", ort.get_device())                 # 'GPU' なら可、'CPU' なら未使用
print("ORT providers:", ort.get_available_providers())   # 'CUDAExecutionProvider' が含まれるか

try:
    import torch
    print("torch       :", torch.__version__)
    print("torch.cuda  :", torch.cuda.is_available())     # Demucs等のPyTorch側
    if torch.cuda.is_available():
        print("device name :", torch.cuda.get_device_name(0))
except ImportError:
    print("torch not installed (MDX-Net/.onnx のみ使うなら不要)")
```

- If **`'CUDAExecutionProvider'` isn't in `ort.get_available_providers()`, the GPU isn't being used** (CPU execution = painfully slow).
- If `ort.get_device()` is **`'CPU'`**, the same.

If these two indicate "not used," the cause is almost certainly symptom ①.

---

## [Most common] GPU not used, painfully slow on CPU

This is **the most common and the hardest to notice** problem. **No error at all, just tens of times slower.**

### Why it silently falls back to CPU

ONNX Runtime runs on a **priority list of Execution Providers.** As the official explains —

> *`['CUDAExecutionProvider', 'CPUExecutionProvider']` means execute a node using CUDAExecutionProvider if capable, otherwise execute using CPUExecutionProvider.*

That is, **if CUDA can't be used, it silently executes on CPU.** This is the true identity of "it works but is slow."

### Cause A: coexistence of `onnxruntime` and `onnxruntime-gpu` (most likely)

Install **both** the CPU `onnxruntime` and the GPU `onnxruntime-gpu` and they **deploy to the same directory and overwrite last-wins**, sometimes silently erasing `CUDAExecutionProvider`.

> 🔎 This is **not a story with an explicit warning in ONNX Runtime's official documentation** but a **widely-known gotcha in the community** like [microsoft/onnxruntime's issue](https://github.com/microsoft/onnxruntime/issues/7748). The behavior (CPU fallback) can be explained from the providers spec.

**Fix**: remove the CPU version, then reinstall the GPU version.

```bash
pip uninstall -y onnxruntime          # CPU版を必ず先に消す
pip install --force-reinstall onnxruntime-gpu
```

### Cause B: CUDA / cuDNN version mismatch

If the CUDA/cuDNN that `onnxruntime-gpu`'s build requires doesn't mesh with the environment's, the CUDA EP can't load and it falls back to CPU. → To [③ version mismatch](#cudacudnn-version-errors).

### audio-separator's official reinstall procedure

audio-separator's README shows a **clean reinstall procedure** for when the GPU doesn't work (verbatim).

```bash
pip uninstall torch onnxruntime
pip cache purge
pip install --force-reinstall torch torchvision torchaudio
pip install --force-reinstall onnxruntime-gpu
```

If `audio-separator --env_info` outputs `CUDAExecutionProvider available` after this, it's resolved.

---

## CUDA out of memory (crashes)

The case where **`RuntimeError: CUDA out of memory`** appears with a long clip / high-resolution setting.

### The priority of fixes

1. **Lower `segment_size`** (most effective). The chunk put on the GPU becomes smaller, saving VRAM.
2. **Set `batch_size` to 1.** Throughput drops but it reliably reduces VRAM.
3. **Switch to a smaller/lighter model** (RoFormer → MDX-Net, etc. See the [model-selection guide](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter)).
4. **Move up to a GPU with larger VRAM** (g4dn 16GB → g5/g6 24GB).

```python
# OOM対策：チャンクを小さくしてVRAMを抑える
from audio_separator.separator import Separator

sep = Separator(mdx_params={
    "segment_size": 128,   # 既定256から下げる（OOMの一番効く対策）
    "overlap": 0.25,
    "batch_size": 1,       # まとめ処理をやめてVRAMを節約
    "hop_length": 1024,
    "enable_denoise": False,
})
sep.load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx")
```

### Beware the misconception about `empty_cache()`

When OOM appears in the PyTorch line (Demucs, etc.), you'll want to call `torch.cuda.empty_cache()`, but **don't overtrust it.** Per the PyTorch official —

> *Calling empty_cache() releases all unused cached memory from PyTorch ... However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.*

That is, **what's freed is only "unused cache,"** and **the VRAM held by tensors doesn't return.** The essential solution to OOM is to "**reduce the usage in the first place**" = make the chunk/batch smaller. If fragmentation is suspected, tuning the environment variable `PYTORCH_CUDA_ALLOC_CONF` is also an option.

---

## CUDA/cuDNN version errors

`LoadLibrary failed`, `libcudnn.so.X not found`, the CUDA EP can't load, etc.

### ONNX Runtime's version requirements (official table)

ONNX Runtime's CUDA EP has **required CUDA/cuDNN fixed per ORT version** ([official table](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html)). Key points:

| ONNX Runtime | CUDA | cuDNN | Note |
| --- | --- | --- | --- |
| 1.20.x / 1.19.x | 12.x | **9.x** | PyPI default (**the PyPI default from 1.19 is CUDA 12.x**) |
| 1.18.1 | 12.x | 9.x | cuDNN 9 required |
| 1.18.0 | 12.x | 8.x | — |
| 1.20/1.19/1.18.x | 11.8 | 8.x | (the CUDA 11.8 builds of 1.19/1.20 aren't provided on PyPI) |

And the biggest pitfall is this (official verbatim).

> *ONNX Runtime built with cuDNN 8.x is not compatible with cuDNN 9.x, and vice versa.*

**cuDNN 8.x and 9.x are not compatible.** A newer ORT (1.19+) requires cuDNN 9, so if the environment has only cuDNN 8, the CUDA EP can't load and it falls back to CPU.

### Per-environment remedies

- **Colab** (when CUDA 12 is the default but ORT requires the CUDA 11 libraries, README verbatim):

```bash
apt update && apt install -y nvidia-cuda-toolkit   # 不足するCUDAライブラリを補う
```

- **ONNX Runtime for a CUDA 12 environment (the nightly the README guides to)**:

```bash
python -m pip install ort-nightly-gpu \
  --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/
```

- **The basic policy**: **match** the version of `onnxruntime-gpu` to the environment's CUDA/cuDNN **with the table.** If you can't match, install the corresponding CUDA/cuDNN or swap the ORT side. audio-separator's supported CUDA is **11.8 and 12.2.**

---

## ffmpeg not found (audio I/O failure)

`ffmpeg not found`, crashes on audio read/write. audio-separator **requires ffmpeg for audio I/O.**

```bash
# Debian / Ubuntu
apt-get update && apt-get install -y ffmpeg
# macOS
brew install ffmpeg
```

OK if `audio-separator --env_info` outputs **`FFmpeg installed`.** In a container, **always bundle ffmpeg into the image** (runtime apt is slow and unstable).

---

## The model is downloaded every time (slow)

"A few hundred MB DL runs on every startup" — frequent with containers/serverless.

### The cause

The model is **auto-downloaded on first use**, and the default save location is **`/tmp/audio-separator-models/`** (README verbatim). Because **`/tmp` is volatile**, it's re-downloaded on every cold start.

### The fix

- **Persist the save location**: point `model_file_dir` at a persistent volume.
- **Bake it into the image**: DL once at build time and include it in the layer (fastest and most reliable in production).

```python
from audio_separator.separator import Separator

sep = Separator(model_file_dir="/models")   # 永続ボリューム or イメージ同梱パス
sep.load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx")
```

```dockerfile
# Dockerfile（抜粋）：ビルド時にモデルを焼き込み、起動時の再DLをゼロに
ENV MODEL_DIR=/models
RUN python -c "from audio_separator.separator import Separator; \
  s=Separator(model_file_dir='${MODEL_DIR}'); \
  s.load_model(model_filename='UVR-MDX-NET-Inst_HQ_3.onnx')"
```

> This "resident model + image baking" cuts both production cold starts and egress billing at once. The scale design is detailed in the [GPU-worker-foundation article](/blog/music-source-separation-production-api-gpu-worker-queue).

---

## Doesn't work or is slow on Apple Silicon (Mac)

Apple Silicon (M1+) **doesn't support NVIDIA CUDA.** UVR5 itself supports **MPS (GPU) acceleration**, but in the `audio-separator` library, **CPU execution (the `[cpu]` extra) is the basic.**

```bash
pip install "audio-separator[cpu]"   # Apple SiliconはCPU版
```

- Worrying that "the GPU isn't used" on a Mac is **normal** (because there's no CUDA). It runs on CPU but is **slow**, so the realistic answer for high-volume processing is to use a Linux environment with an NVIDIA GPU (including cloud).
- Choosing a light model (MDX-Net) and matching `segment_size` to the environment gets it close to practical even on CPU.

---

## Prevent "silent degradation" in production: a startup guard

Finally, a preventive measure to **never step on the most common ① again.** In production, **verify the GPU at startup and fail-fast if it has fallen back to CPU.** Because running a slow production unnoticed is the most expensive.

```python
# guard.py — 起動時にCUDAが効いているか検証。本番ではCPUフォールバックで起動を止める
import os
import onnxruntime as ort

def assert_gpu(require: bool = True) -> None:
    providers = ort.get_available_providers()
    ok = "CUDAExecutionProvider" in providers and ort.get_device() == "GPU"
    if not ok:
        msg = (f"GPUが有効化されていません（providers={providers}, "
               f"device={ort.get_device()}）。onnxruntime-gpuとCUDA/cuDNNを確認。")
        if require:
            raise RuntimeError(msg)     # CIや本番起動で確実に気づく
        print(f"[WARN] {msg}")

if __name__ == "__main__":
    assert_gpu(require=os.environ.get("REQUIRE_GPU", "true").lower() == "true")
```

Just inserting this into Docker's `HEALTHCHECK` or service startup structurally prevents the worst accident of "**painfully slow on CPU before you knew it.**"

---

## Frequently asked questions (FAQ)

**Q. I have a GPU but `--env_info` doesn't output CUDAExecutionProvider.**
A. First, `pip uninstall -y onnxruntime` to **remove the CPU version** and reinstall `onnxruntime-gpu` (coexistence is most likely). If that doesn't fix it, confirm the [version table](#cudacudnn-version-errors) of ORT and CUDA/cuDNN and suspect a cuDNN 8/9 mix-up.

**Q. The GPU works locally but it's CPU in a container.**
A. Confirm **whether the CUDA/cuDNN runtime is correctly installed** in the container, and whether `onnxruntime` (the CPU version) has slipped in. Also check whether `nvidia-smi` works inside the container (`--gpus all` / GPU runtime).

**Q. Can I fix `CUDA out of memory` with empty_cache?**
A. Basically no. Because `empty_cache()` **only frees unused cache.** **Lowering `segment_size`/`batch_size`** is the right path. If VRAM is insufficient, move up the GPU.

**Q. Why is ffmpeg needed?**
A. It's used for audio read/write (decode/encode). **Without it installed, it crashes on audio I/O.** Install it with `apt-get install -y ffmpeg`, etc.

**Q. I want to use the GPU on a Mac.**
A. Apple Silicon **doesn't support CUDA.** UVR5 itself supports MPS, but library operation is basically CPU. For high-volume processing, use a Linux/cloud with an NVIDIA GPU.

---

## Conclusion: have the pattern of symptom → diagnose → fix

UVR5 / audio-separator trouble is **a swamp if you fix by guesswork**, but **if you take the facts with diagnostic commands, the cause is almost uniquely determined.**

1. **First, take the GPU/ffmpeg facts** with `audio-separator --env_info` and `diagnose.py`.
2. **The most common is the CPU fallback** — remove the `onnxruntime` coexistence and match CUDA/cuDNN to the table.
3. **For OOM, lower the chunk/batch** (`empty_cache` isn't the essential solution).
4. **In production, structurally prevent "silent degradation" with a startup GPU guard.**

> Not stepping into these "environment swamps" **by design** is where outsourcing makes a difference. If you want to build voice/video AI including source separation at production quality, consult along with the [case study](/case-studies/ai-video-localization-lipsync). With **one person × generative AI**, I support end-to-end from environment setup to production operation.

---

## Sources / official resources

- **ONNX Runtime CUDA EP (requirement table, cuDNN compatibility)**: [CUDA Execution Provider](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html)
- **ONNX Runtime Python API**: [API summary (get_available_providers / get_device)](https://onnxruntime.ai/docs/api/python/api_summary.html) / [Install](https://onnxruntime.ai/docs/install/)
- **PyTorch CUDA memory management**: [CUDA semantics (empty_cache / PYTORCH_CUDA_ALLOC_CONF)](https://docs.pytorch.org/docs/stable/notes/cuda.html)
- **audio-separator**: [README (--env_info / reinstall procedure / CUDA 11.8·12.2 / model DL location)](https://github.com/nomadkaraoke/python-audio-separator)
- **The "onnxruntime coexistence" issue**: [microsoft/onnxruntime issue #7748](https://github.com/microsoft/onnxruntime/issues/7748)

* Library version requirements are updated. **Always confirm primary sources before implementing.** The "coexistence of onnxruntime and onnxruntime-gpu" issue is not official text but an explanation based on widely-known behavior in the Issue.