How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

The goal of this article

"I want to split audio into voice and instrumental," "I want to pull just the drums" — when you try to start with music source separation (MSS), the first wall you hit is the problem of too many tools. Demucs, UVR5, MDX-Net, Spleeter, Open-Unmix… all claim to be "cutting-edge" and tout benchmark numbers.

This piece provides a decision framework that compares them cross-sectionally and lets you reverse-look-up "which you should choose for your project" from requirements. Individual tool usage is left to each dedicated article (Demucs / UVR5・MDX-Net); this piece concentrates on the axes of selection. When you finish reading, the goal is a state where you can do these three:

Explain in one line the strengths/weaknesses of the four major OSS (Demucs / UVR5・MDX-Net / Spleeter / Open-Unmix).
From your own requirements (quality, speed, stem count, budget, operations setup), narrow to one without hesitation.
In a commercial project, preemptively eliminate license pitfalls that could become litigation risk.

About the author (reliability disclosure): I have single-handedly designed, implemented, and run in production an AI video-localization platform that fully automates, just by uploading a video, "audio separation → transcription → translation → multilingual dubbing → lip-sync." Which tool to use in its first stage (audio separation) was an actual decision that determined the quality and cost of everything downstream. The selection axes here aren't a patchwork of catalog specs but the judgment criteria I gained while switching tools in real operations.

30-second summary (requirement → recommended tool)

Your requirement	Recommendation	Reason
4-way split of drums/bass/vocals/other at highest quality	Demucs v4 (`htdemucs` / `htdemucs_ft`)	SOTA-class among public models (9.0–9.20 dB SDR)
2-way vocal/instrumental split is the main goal (karaoke, a cappella)	UVR5 (MDX-Net)	Rich vocal/instrumental-specialized models with little residue
Fast and bulk above all, quality is decent	Spleeter	100× real-time on GPU. Ideal for preprocessing prep
A research baseline / lightweight reference implementation	Open-Unmix (UMX)	A straightforward BiLSTM. High reproducibility and readability
A non-engineer using it by hand	UVR5 GUI	Drag & drop. Win/Mac/Linux supported
Automating on a server / making it an API	Demucs API / python-audio-separator	Callable from code in 3 lines
Absolutely the highest quality (delivery, mastering)	Demucs + MDX ensemble	Combines multiple models' estimates. UVR implements it

"The setup procedure for individual tools" goes to each dedicated article. After this, this piece digs one level deeper into the basis for selection.

The big picture of the four major OSS (one line each)

First, grasp "what each tool is" in the shortest form.

Demucs v4 (Meta): a hybrid model that looks at both waveform and spectrogram and bridges them with a Transformer. The overall No.1 at high-quality separation of 4 stems (+ a 6-stem version with guitar/piano). MIT license. → details
UVR5 (Ultimate Vocal Remover) / MDX-Net: a GUI + model group specialized in 2-way vocal/instrumental separation. MDX-Net is a two-stage configuration combining the frequency and time domains, with a reputation for little residue. Anyone can use it via the GUI, and code automation is also possible with python-audio-separator. MIT license. → details
Spleeter (Deezer): made with TensorFlow. It bundles pre-trained 2/4/5-stem models, and its weapon is the overwhelming speed of about 100× real-time on GPU. The quality yields a step to the latest generation, but it's ideal for bulk prep. MIT license.
Open-Unmix (UMX, sigsep): a PyTorch reference implementation. A simple structure estimating the mask with a 3-layer bidirectional LSTM, widely used as a research baseline. Lightweight and readable, but the quality yields to the specially optimized ones above.

Cross-comparison table

The numbers are "rules of thumb at writing time, based on official information." SDR is a relative value that changes with evaluation conditions, so always make the final judgment by measuring on your own material.

Aspect	Demucs v4	UVR5 / MDX-Net	Spleeter	Open-Unmix
Development	Meta	Anjok07 et al. (OSS)	Deezer	sigsep
Method	Waveform + spectrum + Transformer	Two-stage frequency + time	CNN (spectrogram)	BiLSTM (spectrogram)
Stems	4 / 6	Mainly 2 (Vocals/Inst)	2 / 4 / 5	4
Quality (rule of thumb)	Highest (9.0–9.20 dB)	High (vocal separation especially strong)	Mid (speed-prioritized)	Mid (reference implementation)
Speed	Mid (GPU recommended, CPU possible)	Mid (GPU recommended)	Fastest (100× real-time on GPU)	Fairly fast
Setup	`pip install demucs`	GUI / `pip install audio-separator`	`pip install spleeter`	`pip install openunmix`
GUI	None (CLI/API)	Yes	None	None
Code license	MIT	MIT	MIT	MIT
Weights license	MIT	Confirm per model	MIT	UMXL is non-commercial (CC BY-NC-SA)
Suited use	Multi-stem, overall quality	Vocal/instrumental separation	Bulk, fast prep	Research, baseline

A decision flowchart that selects from requirements

I've shaped "which one in the end" so you can narrow it down just by answering questions.

Q1. 何を分けたい？
├─ ボーカル と 伴奏 の2つだけ（カラオケ・アカペラ・吹替前処理）
│    └─ Q2. エンジニアが自動化する？
│         ├─ NO（手作業でOK）         → UVR5 GUI
│         └─ YES（サーバー/CIに組む）  → UVR5系を python-audio-separator で
│                                        ／ または Demucs --two-stems=vocals
│
└─ drums / bass / vocals / other に4分割（リミックス・耳コピ・教育）
     └─ Q3. 品質と速度、どちらを優先？
          ├─ 品質最優先（納品・主役） → Demucs v4（htdemucs_ft）
          ├─ バランス重視           → Demucs v4（htdemucs）
          └─ 速度・大量最優先        → Spleeter（4stems）で下ごしらえ
                                        → 採用分だけ Demucs で本処理（二段構え）

※ 研究のベースライン比較が目的 → Open-Unmix
※ 最後の数%まで品質を詰めたい  → Demucs + MDX のアンサンブル（UVR）

There are two key points in the decision.

The path branches greatly on "2-way" vs. "multi-way." If you want only voice and instrumental, as in karaoke or dubbing preprocessing, a multi-stem model is overspec. The vocal-specialized UVR5/MDX-Net or Demucs's --two-stems=vocals is enough.
Quality and speed are a trade-off. Running every song at highest quality is wasteful. The two-tier setup of "prep all with Spleeter → final-process only the adopted ones with Demucs" is the standard for balancing quality and cost.

Always effective in commercial projects: license pitfalls

This is where the biggest difference is made in B2B projects. If you crudely judge "it's OSS so commercial use is fine," you'll be tripped up later. There are three layers to confirm.

Layer 1: the code license

Demucs, UVR5, Spleeter, and Open-Unmix all have MIT-licensed code that supports commercial use. This is mostly safe.

Layer 2: the weights (trained model) license ← easy to overlook

Even if the code is MIT, the license of the distributed trained model is separate. The most typical trap is Open-Unmix's UMXL — the high-quality weights are under a non-commercial license (CC BY-NC-SA 4.0), and embedding them in a commercial product is a violation (for commercial use, choose the plain UMX / UMXHQ).

🔑 The senior's iron rule: not limited to source separation, when introducing an AI model commercially, always confirm "the code license" and "the weights license" separately. Read not only the README license field but the license on the model distribution page (HuggingFace / each model card). Whether you've made this a habit changes a project's safety.

Layer 3: the rights of the "separated audio itself" ← most important

Even if the tool is commercial-OK, the copyright and master rights of the input song are a separate matter. Separating a commercial song and distributing/selling it as a karaoke track or a cappella requires the rights holder's permission. Freedom of the tool ≠ freedom of the material. Hold this down as a rights-processing issue, not a technical one, under the responsibility of the user (and the client).

Layer to confirm	Demucs	UVR5	Spleeter	Open-Unmix
Code	MIT ✅	MIT ✅	MIT ✅	MIT ✅
Weights	MIT ✅	Confirm per model ⚠️	MIT ✅	UMXL is non-commercial ❌
Separated audio	Always needs separate rights processing ⚠️	Same as left	Same as left	Same as left

Don't swallow the benchmark numbers: the SDR trap

When you see an SDR comparison like "Demucs is 9.2 dB, Spleeter is…," take a step back. SDR (Signal-to-Distortion Ratio) is a relative value that strongly depends on the evaluation dataset, the model's version, and the evaluation method (how frame aggregation is taken). It's not uncommon that paper A's 9.2 dB and article B's number weren't measured on the same footing in the first place.

The correct attitude in practice is this.

Use official benchmarks as "a rule of thumb for ordering" (you can trust the magnitude relation Demucs > Spleeter).
Always make the final judgment by measuring on "your own material." J-POP, narration, podcasts, live recordings — strengths/weaknesses change with genre and mic environment.
Measure not by ear alone but with numbers. Produce SDR/SIR/SAR with museval (BSSEval v4) and line up candidate tools on the same material and the same metric. The concrete procedure is collected in the article on measuring source-separation quality with numbers.

When I choose a tool in a project, I take a two-stage process: "narrow to 2–3 with official benchmarks → compare with museval on 10 of the customer's real materials → confirm by ear too." Deciding by catalog numbers alone causes the accident of "more residue than expected" on production material.

Build vs. buy: OSS self-host or commercial API

"Whether to run it yourself in the first place, or use a SaaS API" is also part of selection.

Aspect	OSS self-host (Demucs, etc.)	Commercial API / SaaS
Initial cost	Environment setup, GPU procurement	Nearly zero (same day)
Unit price	Cheap at high volume (fill the GPU)	Usage-based piles up
Data sovereignty	Completes in-house (sensitive audio is safe)	External transmission required
Customization	Models and parameters freely	Depends on the provided range
Operations load	You watch it (where this article group comes in)	Can be left to them

Rules of thumb for the judgment:

Small / prototype / low data sensitivity → first confirm quality with a commercial API or a local CLI.
Steadily high volume / high data sensitivity / want to lower the unit price → self-host OSS. Run it with GPU workers. The production design for that is detailed in the article on making source separation a production API.

My stance, advancing development with one person × generative AI, is "first confirm quality and demand with OSS locally, and once the volume is visible, put it on a self-hosted GPU worker." This is the realistic solution that doesn't pay wasteful fixed costs and avoids lock-in.

Frequently asked questions (FAQ)

Q. In the end, what should I install first? A. Demucs for multi-stem separation, UVR5 for just vocal/instrumental. Holding down these two covers most projects. If in doubt, start with Demucs.

Q. Which is better, Demucs or UVR5(MDX-Net)? A. Depends on the purpose. For overall power wanting drums/bass/other, Demucs; for little residue in vocal/instrumental, the MDX-Net family is strong in many situations. To aim for the highest quality, ensemble both (UVR implements it).

Q. Is Spleeter already old? A. It yields to the latest generation in quality, but its speed is still top-class. It's active as the front stage of "prep everything fast → final-process only the adopted ones with a high-quality model."

Q. Is it OK to embed in a commercial product? A. The code is all MIT and mostly OK. But always confirm the weights license (especially Open-Unmix UMXL is non-commercial) and the rights processing of the input song separately (the license chapter).

Q. I only have a CPU. A. It works (slowly). Demucs is about 1.5× real-time, Spleeter is relatively light. GPU is recommended for bulk, but CPU is enough for small verification.

Q. Should I just choose the one with the highest SDR on the benchmark? A. That's dangerous. SDR is a relative value depending on evaluation conditions. The iron rule is to measure on your own material with museval and choose (the SDR trap).

Conclusion: tool selection goes "requirement → constraint → measurement"

There's no "one all-purpose" source-separation OSS. That's exactly why selection should be done with a framework, not by feel.

Requirement: what you want to split (2-way or multi-way), whether you prioritize quality or speed.
Constraint: license (the three layers of code, weights, material), budget, operations setup, data sensitivity.
Measurement: narrow candidates to 2–3 and decide after comparing on your own material with museval.

Nail it down in this order and you avoid "install the famous one for now and regret it later." And — this very selection is where outsourcing produces value. Anyone can just run a tool, but reading the requirements and constraints to choose the optimal one (or combination), and eliminating even license risk, turns experience directly into quality.

I've decided tools by using the selection axes here on the AI video-localization platform I actually run in production. If you're considering technology selection, PoC, or productionization of voice/video AI including source separation, take a look at my track record and feel free to consult me. With one person × generative AI, I accompany you from decision-making to implementation — fast, cheap, and safe.

Sources / official resources

Demucs: adefossez/demucs — model list, SDR, license (explanation article)
UVR5 / MDX-Net: Anjok07/ultimatevocalremovergui / MDX-Net paper arXiv:2111.12203 (explanation article)
Spleeter: deezer/spleeter — TensorFlow, 2/4/5 stems, MIT
Open-Unmix: sigsep/open-unmix-pytorch — UMX/UMXHQ/UMXL (UMXL is a non-commercial license)
Evaluation tool: sigsep/sigsep-mus-eval (museval) — BSSEval v4 (SDR/ISR/SIR/SAR)

Licenses, quality, and pricing get updated. For commercial use, always confirm primary sources (especially the weights license). The numbers here are rules of thumb based on official information at writing time.

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

The goal of this article

30-second summary (requirement → recommended tool)

The big picture of the four major OSS (one line each)

Cross-comparison table

A decision flowchart that selects from requirements

Always effective in commercial projects: license pitfalls

Layer 1: the code license

Layer 2: the weights (trained model) license ← easy to overlook

Layer 3: the rights of the "separated audio itself" ← most important

Don't swallow the benchmark numbers: the SDR trap

Build vs. buy: OSS self-host or commercial API

Frequently asked questions (FAQ)

Conclusion: tool selection goes "requirement → constraint → measurement"

Sources / official resources

Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Turning source separation into a production API: the design of GPU worker × job queue × idempotency

Also worth reading

AI lip-sync / talking-head model selection guide 2026 — choosing MuseTalk, LatentSync, Wav2Lip, SadTalker by commercial license, quality, speed, and production operation

Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

The goal of this article

30-second summary (requirement → recommended tool)

The big picture of the four major OSS (one line each)

Cross-comparison table

A decision flowchart that selects from requirements

Always effective in commercial projects: license pitfalls

Layer 1: the code license

Layer 2: the weights (trained model) license ← easy to overlook

Layer 3: the rights of the "separated audio itself" ← most important

Don't swallow the benchmark numbers: the SDR trap

Build vs. buy: OSS self-host or commercial API

Frequently asked questions (FAQ)

Conclusion: tool selection goes "requirement → constraint → measurement"

Sources / official resources

Related articles

Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Turning source separation into a production API: the design of GPU worker × job queue × idempotency

Also worth reading

AI lip-sync / talking-head model selection guide 2026 — choosing MuseTalk, LatentSync, Wav2Lip, SadTalker by commercial license, quality, speed, and production operation

Llama 4 multimodal in practice: use image understanding for production-grade 'type-safe structured extraction'

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production