Skip to main content
友田 陽大
Audio source separation & preprocessing
音源分離
Demucs
UVR5
技術選定
音声処理
Python
生成AI

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

A cross-comparison of the major music-source-separation OSS — Demucs v4, UVR5(MDX-Net), Spleeter, Open-Unmix — by quality, speed, license, setup difficulty, and memory. It explains, with real code, a decision framework you can reverse-look-up from requirements ('which to choose for which project') and the license pitfalls you must always confirm for commercial use.

Published
Reading time
12 min read
Author
友田 陽大
Share

The goal of this article

"I want to split audio into voice and instrumental," "I want to pull just the drums" — when you try to start with music source separation (MSS), the first wall you hit is the problem of too many tools. Demucs, UVR5, MDX-Net, Spleeter, Open-Unmix… all claim to be "cutting-edge" and tout benchmark numbers.

This piece provides a decision framework that compares them cross-sectionally and lets you reverse-look-up "which you should choose for your project" from requirements. Individual tool usage is left to each dedicated article (Demucs / UVR5・MDX-Net); this piece concentrates on the axes of selection. When you finish reading, the goal is a state where you can do these three:

  1. Explain in one line the strengths/weaknesses of the four major OSS (Demucs / UVR5・MDX-Net / Spleeter / Open-Unmix).
  2. From your own requirements (quality, speed, stem count, budget, operations setup), narrow to one without hesitation.
  3. In a commercial project, preemptively eliminate license pitfalls that could become litigation risk.

About the author (reliability disclosure): I have single-handedly designed, implemented, and run in production an AI video-localization platform that fully automates, just by uploading a video, "audio separation → transcription → translation → multilingual dubbing → lip-sync." Which tool to use in its first stage (audio separation) was an actual decision that determined the quality and cost of everything downstream. The selection axes here aren't a patchwork of catalog specs but the judgment criteria I gained while switching tools in real operations.


Your requirementRecommendationReason
4-way split of drums/bass/vocals/other at highest qualityDemucs v4 (htdemucs / htdemucs_ft)SOTA-class among public models (9.0–9.20 dB SDR)
2-way vocal/instrumental split is the main goal (karaoke, a cappella)UVR5 (MDX-Net)Rich vocal/instrumental-specialized models with little residue
Fast and bulk above all, quality is decentSpleeter100× real-time on GPU. Ideal for preprocessing prep
A research baseline / lightweight reference implementationOpen-Unmix (UMX)A straightforward BiLSTM. High reproducibility and readability
A non-engineer using it by handUVR5 GUIDrag & drop. Win/Mac/Linux supported
Automating on a server / making it an APIDemucs API / python-audio-separatorCallable from code in 3 lines
Absolutely the highest quality (delivery, mastering)Demucs + MDX ensembleCombines multiple models' estimates. UVR implements it

"The setup procedure for individual tools" goes to each dedicated article. After this, this piece digs one level deeper into the basis for selection.


The big picture of the four major OSS (one line each)

First, grasp "what each tool is" in the shortest form.

  • Demucs v4 (Meta): a hybrid model that looks at both waveform and spectrogram and bridges them with a Transformer. The overall No.1 at high-quality separation of 4 stems (+ a 6-stem version with guitar/piano). MIT license. → details
  • UVR5 (Ultimate Vocal Remover) / MDX-Net: a GUI + model group specialized in 2-way vocal/instrumental separation. MDX-Net is a two-stage configuration combining the frequency and time domains, with a reputation for little residue. Anyone can use it via the GUI, and code automation is also possible with python-audio-separator. MIT license. → details
  • Spleeter (Deezer): made with TensorFlow. It bundles pre-trained 2/4/5-stem models, and its weapon is the overwhelming speed of about 100× real-time on GPU. The quality yields a step to the latest generation, but it's ideal for bulk prep. MIT license.
  • Open-Unmix (UMX, sigsep): a PyTorch reference implementation. A simple structure estimating the mask with a 3-layer bidirectional LSTM, widely used as a research baseline. Lightweight and readable, but the quality yields to the specially optimized ones above.

Cross-comparison table

The numbers are "rules of thumb at writing time, based on official information." SDR is a relative value that changes with evaluation conditions, so always make the final judgment by measuring on your own material.

AspectDemucs v4UVR5 / MDX-NetSpleeterOpen-Unmix
DevelopmentMetaAnjok07 et al. (OSS)Deezersigsep
MethodWaveform + spectrum + TransformerTwo-stage frequency + timeCNN (spectrogram)BiLSTM (spectrogram)
Stems4 / 6Mainly 2 (Vocals/Inst)2 / 4 / 54
Quality (rule of thumb)Highest (9.0–9.20 dB)High (vocal separation especially strong)Mid (speed-prioritized)Mid (reference implementation)
SpeedMid (GPU recommended, CPU possible)Mid (GPU recommended)Fastest (100× real-time on GPU)Fairly fast
Setuppip install demucsGUI / pip install audio-separatorpip install spleeterpip install openunmix
GUINone (CLI/API)YesNoneNone
Code licenseMITMITMITMIT
Weights licenseMITConfirm per modelMITUMXL is non-commercial (CC BY-NC-SA)
Suited useMulti-stem, overall qualityVocal/instrumental separationBulk, fast prepResearch, baseline

A decision flowchart that selects from requirements

I've shaped "which one in the end" so you can narrow it down just by answering questions.

Q1. 何を分けたい?
├─ ボーカル と 伴奏 の2つだけ(カラオケ・アカペラ・吹替前処理)
│    └─ Q2. エンジニアが自動化する?
│         ├─ NO(手作業でOK)         → UVR5 GUI
│         └─ YES(サーバー/CIに組む)  → UVR5系を python-audio-separator で
│                                        / または Demucs --two-stems=vocals
│
└─ drums / bass / vocals / other に4分割(リミックス・耳コピ・教育)
     └─ Q3. 品質と速度、どちらを優先?
          ├─ 品質最優先(納品・主役) → Demucs v4(htdemucs_ft)
          ├─ バランス重視           → Demucs v4(htdemucs)
          └─ 速度・大量最優先        → Spleeter(4stems)で下ごしらえ
                                        → 採用分だけ Demucs で本処理(二段構え)

※ 研究のベースライン比較が目的 → Open-Unmix
※ 最後の数%まで品質を詰めたい  → Demucs + MDX のアンサンブル(UVR)

There are two key points in the decision.

  • The path branches greatly on "2-way" vs. "multi-way." If you want only voice and instrumental, as in karaoke or dubbing preprocessing, a multi-stem model is overspec. The vocal-specialized UVR5/MDX-Net or Demucs's --two-stems=vocals is enough.
  • Quality and speed are a trade-off. Running every song at highest quality is wasteful. The two-tier setup of "prep all with Spleeter → final-process only the adopted ones with Demucs" is the standard for balancing quality and cost.

Always effective in commercial projects: license pitfalls

This is where the biggest difference is made in B2B projects. If you crudely judge "it's OSS so commercial use is fine," you'll be tripped up later. There are three layers to confirm.

Layer 1: the code license

Demucs, UVR5, Spleeter, and Open-Unmix all have MIT-licensed code that supports commercial use. This is mostly safe.

Layer 2: the weights (trained model) license ← easy to overlook

Even if the code is MIT, the license of the distributed trained model is separate. The most typical trap is Open-Unmix's UMXL — the high-quality weights are under a non-commercial license (CC BY-NC-SA 4.0), and embedding them in a commercial product is a violation (for commercial use, choose the plain UMX / UMXHQ).

🔑 The senior's iron rule: not limited to source separation, when introducing an AI model commercially, always confirm "the code license" and "the weights license" separately. Read not only the README license field but the license on the model distribution page (HuggingFace / each model card). Whether you've made this a habit changes a project's safety.

Layer 3: the rights of the "separated audio itself" ← most important

Even if the tool is commercial-OK, the copyright and master rights of the input song are a separate matter. Separating a commercial song and distributing/selling it as a karaoke track or a cappella requires the rights holder's permission. Freedom of the tool ≠ freedom of the material. Hold this down as a rights-processing issue, not a technical one, under the responsibility of the user (and the client).

Layer to confirmDemucsUVR5SpleeterOpen-Unmix
CodeMIT ✅MIT ✅MIT ✅MIT ✅
WeightsMIT ✅Confirm per model ⚠️MIT ✅UMXL is non-commercial ❌
Separated audioAlways needs separate rights processing ⚠️Same as leftSame as leftSame as left

Don't swallow the benchmark numbers: the SDR trap

When you see an SDR comparison like "Demucs is 9.2 dB, Spleeter is…," take a step back. SDR (Signal-to-Distortion Ratio) is a relative value that strongly depends on the evaluation dataset, the model's version, and the evaluation method (how frame aggregation is taken). It's not uncommon that paper A's 9.2 dB and article B's number weren't measured on the same footing in the first place.

The correct attitude in practice is this.

  • Use official benchmarks as "a rule of thumb for ordering" (you can trust the magnitude relation Demucs > Spleeter).
  • Always make the final judgment by measuring on "your own material." J-POP, narration, podcasts, live recordings — strengths/weaknesses change with genre and mic environment.
  • Measure not by ear alone but with numbers. Produce SDR/SIR/SAR with museval (BSSEval v4) and line up candidate tools on the same material and the same metric. The concrete procedure is collected in the article on measuring source-separation quality with numbers.

When I choose a tool in a project, I take a two-stage process: "narrow to 2–3 with official benchmarks → compare with museval on 10 of the customer's real materials → confirm by ear too." Deciding by catalog numbers alone causes the accident of "more residue than expected" on production material.


Build vs. buy: OSS self-host or commercial API

"Whether to run it yourself in the first place, or use a SaaS API" is also part of selection.

AspectOSS self-host (Demucs, etc.)Commercial API / SaaS
Initial costEnvironment setup, GPU procurementNearly zero (same day)
Unit priceCheap at high volume (fill the GPU)Usage-based piles up
Data sovereigntyCompletes in-house (sensitive audio is safe)External transmission required
CustomizationModels and parameters freelyDepends on the provided range
Operations loadYou watch it (where this article group comes in)Can be left to them

Rules of thumb for the judgment:

  • Small / prototype / low data sensitivity → first confirm quality with a commercial API or a local CLI.
  • Steadily high volume / high data sensitivity / want to lower the unit priceself-host OSS. Run it with GPU workers. The production design for that is detailed in the article on making source separation a production API.

My stance, advancing development with one person × generative AI, is "first confirm quality and demand with OSS locally, and once the volume is visible, put it on a self-hosted GPU worker." This is the realistic solution that doesn't pay wasteful fixed costs and avoids lock-in.


Frequently asked questions (FAQ)

Q. In the end, what should I install first? A. Demucs for multi-stem separation, UVR5 for just vocal/instrumental. Holding down these two covers most projects. If in doubt, start with Demucs.

Q. Which is better, Demucs or UVR5(MDX-Net)? A. Depends on the purpose. For overall power wanting drums/bass/other, Demucs; for little residue in vocal/instrumental, the MDX-Net family is strong in many situations. To aim for the highest quality, ensemble both (UVR implements it).

Q. Is Spleeter already old? A. It yields to the latest generation in quality, but its speed is still top-class. It's active as the front stage of "prep everything fast → final-process only the adopted ones with a high-quality model."

Q. Is it OK to embed in a commercial product? A. The code is all MIT and mostly OK. But always confirm the weights license (especially Open-Unmix UMXL is non-commercial) and the rights processing of the input song separately (the license chapter).

Q. I only have a CPU. A. It works (slowly). Demucs is about 1.5× real-time, Spleeter is relatively light. GPU is recommended for bulk, but CPU is enough for small verification.

Q. Should I just choose the one with the highest SDR on the benchmark? A. That's dangerous. SDR is a relative value depending on evaluation conditions. The iron rule is to measure on your own material with museval and choose (the SDR trap).


Conclusion: tool selection goes "requirement → constraint → measurement"

There's no "one all-purpose" source-separation OSS. That's exactly why selection should be done with a framework, not by feel.

  1. Requirement: what you want to split (2-way or multi-way), whether you prioritize quality or speed.
  2. Constraint: license (the three layers of code, weights, material), budget, operations setup, data sensitivity.
  3. Measurement: narrow candidates to 2–3 and decide after comparing on your own material with museval.

Nail it down in this order and you avoid "install the famous one for now and regret it later." And — this very selection is where outsourcing produces value. Anyone can just run a tool, but reading the requirements and constraints to choose the optimal one (or combination), and eliminating even license risk, turns experience directly into quality.

I've decided tools by using the selection axes here on the AI video-localization platform I actually run in production. If you're considering technology selection, PoC, or productionization of voice/video AI including source separation, take a look at my track record and feel free to consult me. With one person × generative AI, I accompany you from decision-making to implementation — fast, cheap, and safe.


Sources / official resources

  • Licenses, quality, and pricing get updated. For commercial use, always confirm primary sources (especially the weights license). The numbers here are rules of thumb based on official information at writing time.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading