UVR5 (MDX-Net) Complete Guide: Separating Vocals/Accompaniment with High Accuracy and Automating It in Production, Faithful to Official Sources

The goal of this article

UVR5 (Ultimate Vocal Remover GUI) is a standard open-source tool that separates one piece of audio into vocals (voice) and accompaniment (instrumental). The official README states its role this way—"This application uses state-of-the-art source separation models to remove vocals from audio files."

And MDX-Net is, among the multiple AI architectures UVR5 carries, the core model most commonly used for vocal/accompaniment separation.

This piece, while strictly based on its official information (UVR5 GitHub / the MDX-Net paper arXiv:2111.12203 / python-audio-separator), fills in, with actually-working code, "in which scene, how to use it, and where you get stuck"—which isn't written in the official docs. By the time you finish reading, you should be able to do the following three things.

Explain to someone what tool UVR5 and MDX-Net are, and why they're accurate.
Judge whether to choose the GUI (to try) or python-audio-separator (to automate), and get your hands moving today.
Build a resilient implementation that withstands not a demo but production—OOM on long inputs, CPU fallback where the GPU isn't working, the model's first download.

About the author (disclosure for reliability): I single-handedly designed, implemented, and operate in production an AI video-localization foundation that fully automates "audio separation → transcription → translation → multilingual dubbing → lip sync" just by uploading a video. Its first stage is exactly this source separation. Separating voice and background music (BGM) before transcription/translation raises recognition accuracy, and keeping the accompaniment lets you overlay the dubbed audio—the quality of this preprocessing decides the quality of everything downstream. This article's "pitfalls" and "resilience design" are not demo knowledge but a record of mines I actually stepped on in that real operation. The downstream transcription is in the Whisper article, the lip sync in the LatentSync article, and the whole pipeline in the localization-foundation article.

A 30-second summary (conclusion first)

Aspect	Conclusion
What tool is it	OSS that separates one piece of audio into vocals and accompaniment (and other stems). The GUI version is UVR5 v5.6 (MIT license)
What is MDX-Net	The core AI architecture UVR5 carries. A two-stream structure of frequency domain + time domain, strong at vocal/accompaniment separation
Academic basis	KUIELab-MDX-Net (arXiv:2111.12203). 2nd on Leaderboard A and 3rd on B in the ISMIR 2021 Music Demixing Challenge
Just to try	UVR5 GUI: Win/Mac/Linux. Separate by drag & drop. Usable even by non-engineers
To automate	python-audio-separator: `pip install "audio-separator[gpu]"`. Embed in CLI/server in 3 lines of the `Separator` class
GPU needed	GTX 1060 6GB minimum, 8GB+ VRAM recommended (official README). Apple Silicon supports MPS
Output	By default 2 stems (Vocals / Instrumental), sample rate 44.1kHz
How to choose a model	Want accompaniment → Inst-series (`UVR-MDX-NET-Inst_HQ_3`) / want voice → Vocal-series (`Kim_Vocal_2`) / remove harmony → `UVR_MDXNET_KARA_2`
Main quality knobs	`segment_size` (large = fast, more VRAM) / `overlap` (reduces boundary artifacts) / `enable_denoise` (suppresses reverb noise, ~2× cost)
Suited use	Video-localization preprocessing, making karaoke/a-cappella tracks, podcast BGM removal, audio-dataset preprocessing
Unsuited use	Realtime/zero-latency processing, unauthorized redistribution of stems with copyright handling

If you want to "first see the quality by drag & drop," jump to the GUI section; if you want to "embed it in a server," jump to the code section.

What do UVR5 / MDX-Net do?

The input is one piece of audio (or a video's audio track), and the output is multiple separated stems. In the most basic usage, UVR5 separates audio into the two of "vocals" and "instrumental (accompaniment)."

This works in scenes like the following.

Preprocessing for video localization (dubbing): extract only the voice from the original video and route it to transcription/translation. Running ASR with background music mixed in lowers recognition accuracy, but separating the voice raises it. Further, keep the accompaniment track and remix it under the dubbed audio to prevent the discomfort of the BGM being swapped out too.
Making karaoke tracks / a-cappella: take out only the accompaniment from a song for karaoke, and only the vocals for a-cappella / remix material.
Cleaning podcast / meeting audio: remove BGM or noise mixed into the recording to make the voice easier to hear.
Audio-dataset preprocessing: mass-produce clean audio with music and ambient sound removed for TTS / ASR training.

Conversely, it's not suited to zero-latency realtime processing (instant separation in a live stream, etc.). MDX-Net is batch-oriented, processing one file as a whole, and requires fair compute time in exchange for quality.

⚖️ A copyright caveat (important): the source-separation technology itself is neutral, but the audio you process almost always has copyright. Redistributing or publishing stems obtained by separating a commercial song without permission can be infringement. Even though UVR5's code is MIT-licensed, that's separate from "the rights to the audio you process." Use it within the scope of your own songs, licensed material, or private use.

How it works: why is MDX-Net accurate at separation? (paper-faithful, made easy)

This chapter breaks down the core of MDX-Net while keeping the accuracy. If you only want the implementation, you may jump to the next chapter. But "why the pitfalls come out in those shapes" makes sense once you understand this.

📌 A note for accuracy: UVR5's README has no text explaining "what MDX-Net is" (the carried architecture is only attributed in the credits to Kuielab & Woosung Choi). For the architecture's definition, I reference the following academic papers as primary sources.

The starting point: source separation is "solving a mixed waveform"

A song's waveform is a single signal where vocals, drums, bass, and others are summed. Source separation (music demixing) is the task of solving this sum back into the original components. There are broadly two approaches.

Frequency domain (time-frequency): convert the audio to a spectrogram (a time × frequency image) via STFT, and estimate each component's mask with an image-processing-like U-Net.
Time domain: process the waveform itself directly with a convolutional network (Demucs is representative).

Each has strengths and weaknesses, and either one alone has limits.

MDX-Net's answer: blend with a two-stream structure

MDX-Net's origin is Korea's KUIELab (Korea University)'s "KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing" (arXiv:2111.12203, Minseok Kim, Woosung Choi, et al.). The paper's abstract concisely expresses the design philosophy.

The proposed model has a time-frequency branch and a time-domain branch, where each branch separates stems, respectively. It blends results from two streams to generate the final estimation.

In other words—

Frequency-domain branch: estimate stems with TFC-TDF-U-Net v2 (the lineage described later) that processes the spectrogram.
Time-domain branch: use a pretrained Demucs (without additional training) to estimate from the waveform side too.
Blend: integrate the two estimates with a weighted average for the final output.

The frequency domain is good at "the outline of components," and the time domain at "the texture of the waveform," and synthesizing both yields more stably high quality than a single model—this is the true nature of MDX-Net's strength.

Track record: Sony/AIcrowd's Music Demixing Challenge 2021

KUIELab-MDX-Net won 2nd on Leaderboard A and 3rd on Leaderboard B in the ISMIR 2021 Music Demixing (MDX) Challenge hosted by Sony and AIcrowd (from the paper's abstract).

⚠️ For accuracy: this is "2nd/3rd," not "winning." Some online explanations write "won," but following the primary source (the paper's abstract), this article states it accurately. Even so, this ranking was achieved under limited training-data conditions, and the evaluation is that it's sufficiently high as practical quality.

Lineage: the TFC-TDF block (ISMIR 2020)

The foundation of MDX-Net's frequency-domain branch is in the same lab's prior work "Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation" (arXiv:1912.02591, ISMIR 2020, Woosung Choi et al.). The TFC-TDF block (Time-Frequency Convolutions + Time-Distributed Fully-connected) proposed there yielded high separation performance as the intermediate block of a U-Net handling complex spectrograms.

To organize, the lineage is this.

TFC-TDF ブロック / U-Net           （ISMIR 2020, arXiv:1912.02591）
        ↓ evolves
TFC-TDF-U-Net v2 を内蔵した
KUIELab-MDX-Net（two-stream + Demucs）（MDX Challenge 2021, arXiv:2111.12203）
        ↓ implemented in UVR5 & models distributed
UVR-MDX-NET / Kim_Vocal / KARA …  （bundled in Anjok07/ultimatevocalremovergui）

This mechanism decides the "shape of the pitfalls"

Grasp the mechanism and the troubles described later become inevitable.

Frequency-domain processing loads onto the GPU in STFT chunk units (segment_size), so this directly governs VRAM and speed.
Seam artifacts appear at chunk boundaries, so overlap overlaps them to mitigate (which makes it slower).
Inference runs on ONNX Runtime (the .onnx model), so if CUDA/cuDNN doesn't mesh, it silently falls to CPU and gets terribly slow.

In other words, the parameters aren't knobs to turn "by feel," but knobs that have meaning on the extension line of this design.

How to Choose an Architecture: MDX-Net vs VR vs Demucs vs MDX23C

UVR5 carries multiple architectures besides MDX-Net (the original authors are clearly noted in the README credits). Choosing by purpose is correct; there's no single all-purpose one.

Architecture	Method	Original author (README)	Good at	Format
MDX-Net	Frequency + time two-stream	Kuielab & Woosung Choi	2-way vocal/accompaniment separation, overall balance of speed and quality	`.onnx`
VR Architecture	Spectrogram U-Net	tsurumeso	Reverb/echo removal, fine control like TTA/aggression	`.pth`
Demucs (v4 htdemucs)	Time domain	Adefossez & Demucs (Meta)	4-stem separation (vocals/drums/bass/other), Apple Silicon support	`.ckpt`
MDX23C / MDXC	MDX's successor, higher quality	ZFTurbo	Latest-generation high-quality separation	`.ckpt`

The choice is simple.

You just want to split into the two of vocals and accompaniment → MDX-Net. The most proven, with a good balance of speed and quality. The star of this article.
You want to split down to 4 stems including drums, bass, and other (remix, transcription, ear-copy) → Demucs v4 (htdemucs). Runs even on Apple Silicon's MPS. Standalone Demucs production operation is covered in detail in the dedicated guide.
You want to remove reverb/echo → VR Architecture's dedicated models.
You just want to try the latest, highest quality → MDX23C/MDXC, or an ensemble of multiple models.

💡 Ensemble as a trump card: UVR5 has an Ensemble mode that integrates multiple models' outputs. Combining Kim_Vocal_2 (vocal-specialized) and UVR-MDX-NET-Inst_HQ_3 (accompaniment-specialized) tends to reduce residue versus standalone—but the processing time grows by the number of models, so it's a choice when quality is the top priority.

Usage A: First, Try It in the GUI (UVR5 itself)

At the stage of "I want to first check whether the quality holds on my own audio," hitting the GUI without writing code is the shortest path. Usable even by non-engineers.

Get the installer matching your OS from the official releases (Windows 10+ / macOS Big Sur+ / Linux). On Windows, installing directly under the C: drive is recommended.
After launch, select MDX-Net in Process Method.
Select a model like UVR-MDX-NET-Inst_HQ_3 (accompaniment extraction), and the model auto-downloads the first time.
Drag & drop an audio/video file, specify the output destination, and Start Processing.

Operation requirements (official README):

The GPU is NVIDIA GTX 1060 6GB as the minimum requirement, 8GB+ VRAM recommended.
AMD support is currently limited.
Apple Silicon (M1 and later) supports MPS (GPU) acceleration—usable with Demucs v4 and all MDX-Net models.
Intel Pentium / Celeron-series CPUs are out of operation guarantee.

The GUI is handy, but it's unsuited to batch processing, server embedding, or CI integration. "By hand, one file at a time, every time" breaks down in production. Do automation with the library in the next chapter.

Usage B: Automate with Code (python-audio-separator)

UVR5 itself is a GUI app, so you can't hit it directly from a script. The right answer when automating with code is to use python-audio-separator, a library that can load the same models (.onnx etc.) as UVR. You can run the same MDX-Net models as UVR5 from CLI / Python.

1. Install (faithful to the official procedure)

# GPU（NVIDIA CUDA）。本番はこちら
pip install "audio-separator[gpu]"

# CPU / Apple Silicon
pip install "audio-separator[cpu]"

# conda
conda install audio-separator -c pytorch -c conda-forge

Python 3.10 or higher is required.
Inference runs on ONNX Runtime (.onnx MDX-Net/VR models) + PyTorch (Demucs etc. .ckpt).
Supported CUDA is 11.8 and 12.2. When the GPU version "works but is terribly slow," it's almost always a mismatch of CUDA/cuDNN and onnxruntime-gpu (Pitfall 1 below).

2. Minimal code (works in 3 lines)

The official minimal example is this. Make a Separator, load a model, and call separate().

from audio_separator.separator import Separator

separator = Separator()                  # 既定設定で初期化
separator.load_model(                     # MDX-Netの伴奏抽出モデルを指定（初回は自動DL）
    model_filename="UVR-MDX-NET-Inst_HQ_3.onnx"
)
output_files = separator.separate("song.wav")
print(output_files)
# 例: ['song_(Vocals)_UVR-MDX-NET-Inst_HQ_3.wav',
#      'song_(Instrumental)_UVR-MDX-NET-Inst_HQ_3.wav']

separate() returns a list of the generated output file paths. By default, the 2 stems of Vocals and Instrumental are output at 44.1kHz. The output format can be specified, and the library's default is WAV.

⚠️ A default-value discrepancy (caution): the default of output_format differs—WAV in Python's Separator and FLAC in the CLI (--output_format). If you remix downstream, explicitly specify the lossless flac/wav to be safe (mp3 degrades).

3. CLI (for batch, shell, CI)

For one-off or batch execution on a server, the CLI is handy.

# 1ファイルを伴奏抽出モデルで分離
audio-separator song.wav \
  --model_filename UVR-MDX-NET-Inst_HQ_3.onnx \
  --output_dir ./stems \
  --output_format flac

# 利用可能なモデルを一覧（"今この瞬間"の正確なファイル名はこれで確認するのが確実）
audio-separator --list_models --list_filter vocals

# 伴奏だけ欲しい（片方のステムのみ出力＝書き出しを減らす）
audio-separator song.wav \
  --model_filename UVR-MDX-NET-Inst_HQ_3.onnx \
  --single_stem Instrumental

🔧 A parameter whose name differs between the CLI and the library: the option to output only one stem is --single_stem in the CLI and output_single_stem in Python's Separator. Same function but different names, so beware of mix-ups.

4. MDX-Net's parameters (defaults and meanings)

Control MDX-Net's behavior with Separator's mdx_params. The following are audio-separator's defaults.

separator = Separator(
    output_dir="stems",
    output_format="flac",        # 可逆。後段の品質劣化を避ける
    sample_rate=44100,           # 既定。CD品質
    mdx_params={
        "segment_size": 256,     # GPUに載せるチャンクの大きさ。大=速い・VRAM増
        "overlap": 0.25,         # チャンク境界の重なり。大=継ぎ目低減・低速
        "batch_size": 1,         # まとめて処理する数。大=スループット増・VRAM増
        "hop_length": 1024,      # STFTのホップ幅。基本いじらない
        "enable_denoise": False, # 位相反転で残響ノイズを抑制。約2倍コスト
    },
)
separator.load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx")

💡 These numbers are library defaults. For details of enable_denoise and overlap's internal behavior, parts beyond the official defaults are based on community knowledge (UVR Discussion #831). The iron rule is first run with the defaults, and move only what you're unhappy with.

Model-selection guide: reverse-lookup from your purpose

Choosing the model by which stem you want is the basis. "Using a vocal-specialized model when you want accompaniment" lowers quality. The following filenames were confirmed in UVR's official model_name_mapper.json (✅), but the distribution list is updated, so always confirm existence with audio-separator --list_models before going to production.

Purpose	Recommended model	Type	Status
Cleanly extract accompaniment (instrumental)	`UVR-MDX-NET-Inst_HQ_3.onnx`	Inst	✅ confirmed in official mapper
Accompaniment (older generation, for comparison, strong on long sustained tones)	`UVR-MDX-NET-Inst_HQ_2.onnx`	Inst	✅
Cleanly extract vocals (voice)	`Kim_Vocal_2.onnx`	Vocal	✅
Vocals (aiming for an even cleaner a-cappella)	`UVR-MDX-NET-Voc_FT.onnx`	Vocal	🟡 confirm with `--list_models`
Remove harmony/chorus, keep only the main voice	`UVR_MDXNET_KARA_2.onnx`	Karaoke	✅
General-purpose (standard)	`UVR_MDXNET_Main.onnx`	General	✅

The principles of choosing (community consensus, Discussion #444):

"Inst"-series are optimized to keep the accompaniment clean, and "Vocal"-series to extract the voice cleanly. Choose to match the stem you want, and that stem's quality goes up.
Vocal extraction: Kim_Vocal_2 is the standard, fast and high-quality.
Accompaniment extraction: UVR-MDX-NET-Inst_HQ_3 (the HQ-series) is the standard.
Harmony removal (keep only the lead vocal): UVR_MDXNET_KARA_2.
To aim for the highest quality, an ensemble of vocal-specialized + accompaniment-specialized (the time grows accordingly).

📌 These recommendations are community consensus, not the official spec. Avoid assertions with benchmark values like SDR, and compare 2–3 models on your own audio before adopting—that's reliable.

Practical parameter tuning: presets by use

What the official docs show is only the defaults. Here I add the values that hit in practice. The table below is the thinking of presets you can use as-is.

Use	`segment_size`	`overlap`	`enable_denoise`	Aim
Low VRAM (6–8GB)	128–256	0.25	off	Avoid OOM while keeping practical quality
Standard (12–16GB)	256	0.25	off	Default. Start here
Quality top priority (24GB+)	384–512	0.5	on	Minimize seams and noise
Speed priority for large batches	As large as possible	0.25	off	Maximize throughput
Seam noise bothers you	Leave as-is	↑ (0.5)	off	Mitigate boundary artifacts
Reverb / residual noise bothers you	Leave as-is	Leave as-is	on	Suppress via phase inversion (~2× time)

Rephrasing the knobs' meanings in practical terms:

segment_size (chunk size): the larger, the fewer GPU round-trips and faster, but it eats VRAM. If OOM occurs, lower this first. If VRAM has room, make it as large as allowed.
overlap (chunk overlap): raise it and the seam noise at chunk boundaries decreases, but it gets slower accordingly. Raise it only for material where seams bother you.
enable_denoise (noise suppression): infer on both the normal input and a phase-inverted input and average them, suppressing reverb and residual noise. The compute roughly doubles, so only for the final polish.
batch_size: raise it and throughput goes up, but VRAM grows. For one-off processing, 1 is fine.

The cost-direct principle: the knobs that raise quality (overlap↑, denoise on) bounce back almost directly as time = money. The two-tier approach of "run everything with the defaults, and reprocess only the unsatisfactory cuts/songs with high-quality settings" makes the production unit cost most effective.

Applications: where to use it in practice (going beyond the official)

The official docs only explain up to "separating." From here is the usage that actually becomes value on a project.

① Video-localization preprocessing (transcribe after extracting the voice)

In a dubbing pipeline, transcribing audio with BGM mixed in as-is increases misrecognition. Separate only the voice first and pass it to Whisper, and recognition accuracy goes up. Further, keep the accompaniment track and remix it under the dubbed audio to prevent the discomfort of the BGM being swapped out.

# 動画ローカライズ前処理：声だけ抽出 → ASR、伴奏は保持して後で再ミックス
from audio_separator.separator import Separator

# 声だけ欲しいので Vocal特化モデル + 単一ステム出力
vocal_sep = Separator(output_dir="work", output_format="flac",
                      output_single_stem="Vocals")
vocal_sep.load_model(model_filename="Kim_Vocal_2.onnx")
vocals = vocal_sep.separate("episode.wav")        # 声だけのトラック

transcript = whisper_transcribe(vocals[0])         # 伴奏ノイズが消えてASRが安定
# 伴奏は別途 UVR-MDX-NET-Inst_HQ_3 で抜いておき、吹き替え音声と再ミックスする

② Karaoke tracks, a-cappella, remix material

The most straightforward use: take out only the accompaniment (karaoke) or only the vocals (a-cappella / sampling material) from a song. Take accompaniment with the Inst-series and voice with the Vocal-series, and if needed, split into 4 stems with Demucs v4 for remixing or covering.

③ Cleaning podcast / meeting audio

Preprocessing to remove BGM or TV audio mixed into a recording and make the voice easier to hear. Insert it before transcription/summarization and the downstream quality goes up.

④ Audio-dataset preprocessing (TTS / ASR training)

Mass-produce clean audio with music and ambient sound removed from training data. Since data quality decides model quality, the value of inserting source separation here is large.

The 5 pitfalls you'll surely hit in production, and resilience design

Problems that don't occur in a demo (a short single file, a local GPU) but erupt all at once in production (large batches, containers, long inputs). I show the cause and countermeasure in the order I stepped on them in real operation.

Pitfall 1: GPU does not work and falls back to CPU (most frequent, hardest to notice)

Even installing audio-separator[gpu], if the versions of CUDA/cuDNN and onnxruntime-gpu don't mesh, ONNX Runtime silently falls back to CPU execution. No error comes out, it just gets dozens of times slower—the trap hardest to notice and most expensive in production.

Countermeasure: at startup, verify whether CUDAExecutionProvider is available, and in production stop startup (fail-fast).

import onnxruntime as ort

def verify_gpu(require: bool = True) -> None:
    """CUDAが効いているか起動時に検証。本番でCPUに落ちると激遅。"""
    providers = ort.get_available_providers()
    if "CUDAExecutionProvider" not in providers:
        msg = f"CUDA未検出（利用可能: {providers}）。onnxruntime-gpuとCUDA/cuDNNの整合を確認。"
        if require:
            raise RuntimeError(msg)   # 気づけないまま遅い本番を避ける
        print(f"[WARN] {msg}")

Pitfall 2: the model's first download runs on every cold start

The model auto-downloads to model_file_dir (default /tmp/audio-separator-models/) on first use. In containers/serverless /tmp is volatile, so it re-downloads hundreds of MB on every cold start—a breeding ground for latency and egress charges.

Countermeasure: point model_file_dir at a persistent volume, and if possible bake the model into the image (download once at build time).

Pitfall 3: VRAM exhaustion (OOM) at long inputs / high batch_size

audio-separator internally chunk-processes in segment_size units, so it's less prone to whole-input OOM than a diffusion model, but raising segment_size or batch_size too high drops it with CUDA out of memory.

Countermeasure: treat OOM as a normal path, not an exception, and retry by recreating with a lowered segment_size. Reloading the model is high-cost, so the right approach is to choose a segment_size matched to the VRAM from the start.

def separate_with_oom_fallback(audio: str, model: str, segment: int = 256):
    for seg in (segment, segment // 2, 64):     # 段階的に小さくして粘る
        try:
            sep = Separator(mdx_params={"segment_size": seg, "overlap": 0.25,
                                        "batch_size": 1, "hop_length": 1024,
                                        "enable_denoise": False})
            sep.load_model(model_filename=model)
            return sep.separate(audio)
        except RuntimeError as exc:              # onnxruntimeのOOMはRuntimeErrorで届く
            if "memory" not in str(exc).lower() or seg == 64:
                raise                            # OOM以外、または下限＝本物の異常
    raise RuntimeError("unreachable")

Pitfall 4: input-format mismatch

The input supports common formats like WAV/MP3/FLAC/M4A, but a broken header or an unexpected codec causes processing failure.

Countermeasure: insert one stage of normalization with ffmpeg before feeding. Just unifying the sample rate and channels improves stability.

# 投入前の正規化：44.1kHz・ステレオ・wavに統一
ffmpeg -y -i input.m4a -ar 44100 -ac 2 normalized.wav

Pitfall 5: checking quality only "by ear"

The most dangerous thing in large batches is that the quality gate is human listening alone. Broken separations slip through.

Countermeasure: after separation, machine-check with a simple metric like vocal residue left in the accompaniment track, and route only those exceeding a threshold to human review or reprocessing. Even if perfect automatic evaluation is hard, even a gate that just "rejects obvious failures" drastically reduces review effort.

Implement the production-operation design principles (idempotency, resilience, observability, cost)

Consolidate the pitfall countermeasures into one production-oriented microservice. This is the difference between "works" and "doesn't fall over in production." The design policy is four points—① keep the model resident and reuse it, ② an idempotent cache keyed by the input's content, ③ GPU verification and GPU serialization, ④ don't swallow failures—leave them in structured logs.

# separation_service.py — 本番向け音源分離マイクロサービス（UVR5 / MDX-Net）
from __future__ import annotations

import asyncio
import hashlib
import json
import logging
import os
import threading
from pathlib import Path

import onnxruntime as ort
from audio_separator.separator import Separator
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

logger = logging.getLogger("separation")

MODEL_CACHE = Path(os.environ.get("MODEL_CACHE_DIR", "/models"))     # 永続ボリューム推奨
OUTPUT_DIR = Path(os.environ.get("OUTPUT_DIR", "/data/stems"))
DEFAULT_MODEL = os.environ.get("SEPARATION_MODEL", "UVR-MDX-NET-Inst_HQ_3.onnx")
REQUIRE_GPU = os.environ.get("REQUIRE_GPU", "true").lower() == "true"

# MDXの推論パラメータ。再現性のため固定し、冪等キーの一部にする
MDX_PARAMS = {"segment_size": 256, "overlap": 0.25, "batch_size": 1,
              "hop_length": 1024, "enable_denoise": False}

_separator: Separator | None = None
_gpu_sem = asyncio.Semaphore(int(os.environ.get("GPU_CONCURRENCY", "1")))  # GPUは原則直列
_manifest_lock = threading.Lock()
_MANIFEST = OUTPUT_DIR / "manifest.json"


def _verify_gpu() -> None:
    """CUDAが効いているか検証。本番でCPUに落ちると数十倍遅い（落とし穴①）。"""
    providers = ort.get_available_providers()
    if "CUDAExecutionProvider" not in providers:
        msg = f"CUDA未検出（利用可能: {providers}）。onnxruntime-gpuとCUDA/cuDNNを確認。"
        if REQUIRE_GPU:
            raise RuntimeError(msg)
        logger.warning(msg)


def get_separator() -> Separator:
    """プロセス内シングルトン。モデルは一度だけVRAMへ常駐させ、リクエスト間で再利用する。"""
    global _separator
    if _separator is None:
        _verify_gpu()
        OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
        sep = Separator(
            output_dir=str(OUTPUT_DIR),
            output_format="flac",            # 可逆。再ミックスで品質を落とさない
            model_file_dir=str(MODEL_CACHE),  # 初回DLを永続化（落とし穴②）
            sample_rate=44100,
            mdx_params=MDX_PARAMS,
        )
        sep.load_model(model_filename=DEFAULT_MODEL)
        _separator = sep
        logger.info("model loaded model=%s", DEFAULT_MODEL)
    return _separator


def _content_key(path: Path) -> str:
    """ファイル名ではなく"中身+モデル+パラメータ"でキー化。再送で再計算しない。"""
    h = hashlib.sha256()
    with path.open("rb") as f:                      # 大きい音源もメモリに載せず逐次ハッシュ
        for block in iter(lambda: f.read(1 << 20), b""):
            h.update(block)
    h.update(DEFAULT_MODEL.encode())
    h.update(json.dumps(MDX_PARAMS, sort_keys=True).encode())
    return h.hexdigest()[:16]


def _manifest() -> dict[str, list[str]]:
    return json.loads(_MANIFEST.read_text()) if _MANIFEST.exists() else {}


def _record(key: str, stems: list[str]) -> None:
    with _manifest_lock:        # 単一プロセス前提。複数レプリカならRedis等の共有ストアへ
        data = _manifest()
        data[key] = stems
        _MANIFEST.write_text(json.dumps(data))


class SeparateRequest(BaseModel):
    audio_path: str = Field(..., description="サーバーからアクセス可能な入力音声パス")


class SeparateResponse(BaseModel):
    job_key: str
    stems: list[str]
    cached: bool


app = FastAPI()


@app.on_event("startup")
def _startup() -> None:
    get_separator()             # 最初のリクエストを待たせない（ウォームアップ）


@app.post("/separate", response_model=SeparateResponse)
async def separate(req: SeparateRequest) -> SeparateResponse:
    src = Path(req.audio_path)
    if not src.is_file():                                   # 外部入力は境界で検証
        raise HTTPException(status_code=422, detail="input audio not found")

    key = _content_key(src)
    cached = _manifest().get(key)
    if cached and all(Path(p).exists() for p in cached):    # 冪等：再送/リトライで再計算しない
        logger.info("cache hit key=%s", key)
        return SeparateResponse(job_key=key, stems=cached, cached=True)

    try:
        async with _gpu_sem:                                # GPUは1ジョブずつ＝VRAM競合/OOM回避
            stems = await asyncio.to_thread(get_separator().separate, str(src))
    except Exception as exc:                                # 失敗を握りつぶさず構造化ログへ
        logger.exception("separation failed key=%s", key)
        raise HTTPException(status_code=500, detail="separation failed") from exc

    _record(key, stems)
    logger.info("separated key=%s stems=%d", key, len(stems))
    return SeparateResponse(job_key=key, stems=stems, cached=False)

A type-safe client that hits this service type-safely from TypeScript (Next.js etc.) is this. Always validate at the boundary with Zod.

// lib/separation-client.ts — Python分離サービスを叩く型安全クライアント
import { z } from "zod";

const SeparateResponse = z.object({
  job_key: z.string(),
  stems: z.array(z.string()),
  cached: z.boolean(),
});
export type SeparateResult = z.infer<typeof SeparateResponse>;

export async function separateAudio(audioPath: string): Promise<SeparateResult> {
  const res = await fetch(`${process.env.SEPARATION_SERVICE_URL}/separate`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({ audio_path: audioPath }),
  });
  if (!res.ok) throw new Error(`separation failed: ${res.status}`);
  return SeparateResponse.parse(await res.json()); // 外部応答は信用せず境界で検証
}

Restating the design principles in operational terms:

Idempotency: make sha256(the audio's content + the model + the parameters) the job key, and cache the result in the manifest. Don't do a double GPU execution on resends, repeated hits, or retries. The point is keying by content, not filename.
Resilience: treat OOM and CPU fallback as design matters, not exceptions. Reject a GPU-not-found at startup, and retry OOM with a lowered segment_size.
Observability: leave per-job key, model, stem count, cache hit/miss in structured logs, getting into a state where you can later trace "which job produced what with which model." Don't emit the audio content (which can be PII) in logs.
Cost efficiency: ① amortize the load cost with a resident model, ② zero regeneration with the idempotent cache, ③ skip unneeded stem writes with single_stem, ④ raise the quality knobs only for the unsatisfactory parts, not everything—shave the unit cost by this accumulation.
Testability: a pure function like _content_key can be unit-tested without I/O, and the separation body (GPU-dependent) can be mock-swapped at the service boundary.

Comparison with other tools/APIs: which to use in the end?

Choice	Form	Strength	Weakness	Suited scene
UVR5 (GUI)	Desktop app	Free, handy GUI, all of MDX/VR/Demucs/MDXC	No automation, one file at a time	Trial, small volume, non-engineers
python-audio-separator	Python library/CLI	Same models as UVR via code, batch/server embedding	Environment setup (CUDA matching) needed	Production automation, large-volume processing
Demucs (standalone)	Python library	4-stem separation, Meta-made, active	For 2-way vocal/accompaniment, MDX is sometimes preferred	4-stem, remix use
Commercial cloud API	SaaS	Zero environment setup, instant use	Usage billing, data goes outside	PoC, projects where data can go outside

The practical decision is roughly this.

First want to see quality / small volume → UVR5 GUI.
Embed in a server, process large volume continuously → python-audio-separator (this article's production design).
Want to split down to drums/bass → Demucs v4.
Don't want any environment-setup effort, data can go outside → a commercial API. First confirm demand there, and as volume grows, move to self-hosting—the royal road.

Each model's/tool's license and distribution conditions are updated. Always confirm commercial use in primary sources. UVR5's code is MIT at the time of writing (the copyright of the audio you process is a separate matter).

Frequently Asked Questions (FAQ)

Q. What's the relationship between UVR5 and MDX-Net? A. UVR5 (Ultimate Vocal Remover GUI v5.6) is the "app," and MDX-Net is "one of the AI architectures" the app carries. Besides MDX-Net, UVR5 carries VR Architecture, Demucs, and MDX23C/MDXC. For 2-way vocal/accompaniment separation, MDX-Net is the standard.

Q. How do I use it from code? A. UVR5 itself is a GUI so you can't hit it directly from a script. Install python-audio-separator with pip install "audio-separator[gpu]" and load the same MDX-Net model (.onnx) as UVR from a Separator. This article's code section is the shortest path.

Q. Which model should I use? A. Choose by the stem you want. For accompaniment (karaoke), UVR-MDX-NET-Inst_HQ_3; for vocals (a-cappella), Kim_Vocal_2; for harmony removal, UVR_MDXNET_KARA_2. Confirm existence with audio-separator --list_models and compare 2–3 on your own audio—that's reliable.

Q. What GPU / VRAM do I need? A. The official README states GTX 1060 6GB as the minimum requirement, 8GB+ recommended. When VRAM is low, lower segment_size. Apple Silicon (M1+) supports MPS, running Demucs v4 and all MDX-Net models.

Q. I installed the GPU version but it's slow. A. It's almost always falling back to CPU due to a mismatch of CUDA/cuDNN and onnxruntime-gpu. Supported CUDA is 11.8 / 12.2. Verify at startup whether onnxruntime.get_available_providers() includes CUDAExecutionProvider (Pitfall 1).

Q. What's the default output format? A. The Python library's default is WAV, and the CLI's default is FLAC—they differ. If you remix downstream, explicitly specify the lossless flac/wav (mp3 degrades). The output is by default the 2 stems of Vocals / Instrumental at 44.1kHz.

Q. Can I use it for realtime processing? A. Not suited. MDX-Net is batch-oriented, processing one file as a whole. If immediacy is a requirement, consider a different approach.

Q. May I separate a commercial song and publish it? A. Even if technically possible, legally it's a separate matter. Commercial songs have copyright, and the unauthorized redistribution/publishing of the obtained stems can be infringement. Use it within the scope of your own songs, licensed material, or private use.

Summary: take UVR5/MDX-Net from "works" to "earns in production"

The essence of MDX-Net is that it blends a two-stream structure of frequency domain and time domain with a weighted average, yielding more stably high-quality separation than a single model (arXiv:2111.12203). That's exactly why you need to understand and handle the design-rooted knobs segment_size, overlap, and denoise.

The implementation path is simple.

First check the quality on your own audio with the UVR5 GUI (free, drag & drop).
If it feels good, codify it with python-audio-separator, choose the model by purpose, and dial in the parameters.
For production, weave idempotency, resilience, observability, and cost into the design—resident model, GPU verification, GPU serialization, a content-hash idempotent cache, OOM fallback, structured logs.

Only by doing all this does it become a product that "doesn't fall over on a customer's audio," not a demo. And this is the point I most want to convey—this series of design decisions is exactly where outsourcing differentiates. Anyone can make a "just call the model" demo, but a foundation that doesn't break down on CPU fallback, OOM, cold-start re-DL, or large batches is a quality directly proportional to the number of mines you've stepped on.

I operate this article's pitfalls and resilience design in production, in an AI video-localization foundation with audio separation as its first stage. If you're considering building or improving an audio/video AI pipeline including source separation, take a look at my track record and feel free to reach out. With one person × generative AI, I build end-to-end from PoC to production operation—fast, cheap, and safe.

Sources & official resources

UVR5 official repository (GitHub): Anjok07/ultimatevocalremovergui — README (v5.6, MIT, operation requirements, architecture credits)
MDX-Net paper (arXiv): KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing (2111.12203) — two-stream structure, blending, MDX Challenge 2021 results
Lineage paper (arXiv): Investigating U-Nets with various Intermediate Blocks for Spectrogram-based Singing Voice Separation (1912.02591) — the TFC-TDF block (ISMIR 2020)
The library for using it in code: nomadkaraoke/python-audio-separator / PyPI: audio-separator — install, the Separator API, CLI, model list
Discussions on model selection and parameters: Discussion #444 (model recommendations) / Discussion #831 (segment_size etc.)

※ Versions, model names, parameter defaults, and licenses are updated. Always confirm primary sources before implementing (especially confirm model filenames exist with audio-separator --list_models). This article's facts (UVR5 v5.6, MIT; 2nd on Leaderboard A / 3rd on B in MDX Challenge 2021; two-stream structure; audio-separator[gpu], Python 3.10+, CUDA 11.8/12.2; default 2 stems/44.1kHz; output_format library WAV/CLI FLAC, etc.) are based on the official information at the time of writing.