The goal of this article
"I want to extract just the instrumental from a song to make karaoke," "I want to pull just the vocals for a cappella/remix material," "I want to remove the harmony (chorus) and leave only the lead vocal" — all of these can be achieved with UVR5 (Ultimate Vocal Remover) + MDX-Net.
This piece isn't a technical deep-dive but a practical guide that "even a first-timer can actually make today." When you finish reading, you'll be able to:
- Choose the model best suited to your goal (karaoke / a cappella / harmony removal).
- Run separation in both the GUI (easy) and code (bulk/automation).
- Grasp tips to raise audio quality and the easy-to-miss copyright cautions.
About the author: I have single-handedly designed, implemented, and run in production an AI video/audio-processing platform that has source separation as its first stage. The "models per use" and "quality tips" here are knowledge I've verified by actually processing a large volume of audio. The technical mechanism is collected in the UVR5/MDX-Net guide, and for aiming at the best quality, the RoFormer guide.
30-second summary (goal → model to use)
| What you want to make | Model to use (type) | One line |
|---|---|---|
| Karaoke track (instrumental only) | UVR-MDX-NET-Inst_HQ_3 (Inst-type) | The standard for keeping the instrumental clean |
| A cappella (vocals only) | Kim_Vocal_2 (Vocal-type) | Fast, high-quality vocal extraction |
| Harmony/chorus removal (lead voice only) | UVR_MDXNET_KARA_2 (Karaoke-type) | Separates main and harmony |
| Just the best quality | model_bs_roformer_ep_317_... (RoFormer) | Heavy but the current best quality |
| Want to split drums/bass too | htdemucs (Demucs v4) | 4-stem separation |
- Try easily → UVR5 GUI (drag & drop)
- Bulk/automation →
audio-separator(code/CLI) - Dial in quality → wav/flac output, ensemble, objective evaluation
💡 Model filenames get updated. Confirm what exists with
audio-separator --list_models. For the meaning of the types and more detailed selection, see the selection guide.
First, the mechanism in one line
UVR5 is a tool that separates one song into "vocals (voice)" and "instrumental (accompaniment)." That is —
- Extract the Instrumental → a karaoke track
- Extract the Vocals → a cappella
Models come in "those good at keeping the instrumental clean (Inst-type)" and "those good at cleanly pulling the voice (Vocal-type)," and choosing the one matching what you want is the first step to quality.
Method A: easily with the UVR5 GUI (no code)
For just one song, or to try first, the GUI is the shortest path.
- Get the installer matching your OS (Windows / macOS / Linux) from the UVR5 official releases.
- Launch it and select MDX-Net in Process Method.
- Choose the model by goal:
- Karaoke →
UVR-MDX-NET-Inst_HQ_3 - A cappella →
Kim_Vocal_2 - Harmony removal →
UVR_MDXNET_KARA_2
- Karaoke →
- Drag & drop the song file, specify the output destination, and Start Processing.
- The output has two files,
(Instrumental)and(Vocals).
The GUI is easy, but it's hand-work song by song. For album batches or periodic processing, the code version below is overwhelmingly faster.
Method B: with audio-separator (bulk/automation)
With audio-separator (MIT), you can run the same models as UVR5 from code or the CLI.
pip install "audio-separator[gpu]" # GPU。CPU/Macは [cpu]
Karaoke track (instrumental only)
# 伴奏だけ出力(--single_stem Instrumental で片方だけ書き出し)
audio-separator song.wav \
--model_filename UVR-MDX-NET-Inst_HQ_3.onnx \
--single_stem Instrumental \
--output_format flac
A cappella (vocals only)
# acapella.py — ボーカルだけを抽出する
from audio_separator.separator import Separator
sep = Separator(output_dir="out", output_format="flac",
output_single_stem="Vocals") # 声だけ書き出し
sep.load_model(model_filename="Kim_Vocal_2.onnx")
print(sep.separate("song.wav")) # -> ボーカルtrack
Harmony/chorus removal (lead vocal only)
# ハモリ(コーラス)を分けてメインボーカルを残す
sep = Separator(output_dir="out", output_format="flac")
sep.load_model(model_filename="UVR_MDXNET_KARA_2.onnx")
print(sep.separate("acapella.wav")) # メイン/ハモリに分離
🔧 Make the output format wav or flac (lossless). mp3 degrades. Note that the CLI default is FLAC while the Python library default is WAV — they differ (being explicit is safe).
Batch-process an album
"Song by song" breaks down as the count grows. Process all songs in a folder together.
# batch_karaoke.py — フォルダ内の全曲をカラオケ化(伴奏抽出)
from pathlib import Path
from audio_separator.separator import Separator
sep = Separator(output_dir="karaoke", output_format="flac",
output_single_stem="Instrumental")
sep.load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx") # モデルは一度だけロード
for track in sorted(Path("album").glob("*.wav")):
print("processing:", track.name)
sep.separate(str(track)) # 同じSeparatorを再利用=高速
The point is to load the model once and reuse it (loading is heavy). Even hundreds of songs can run overnight this way. For full production scale (thousands of songs, queue-driven), see the GPU-worker platform article.
Five tips to raise audio quality
- Choose the model matching the goal: Inst-type for karaoke, Vocal-type for a cappella. Using the opposite drops quality.
- Output in a lossless format: wav/flac. mp3 degrades further after separation.
- If you need the best quality, RoFormer:
model_bs_roformer_...is the current best quality (heavy). Details in the RoFormer guide. - Ensemble: combining a vocal-specialized and an instrumental-specialized model reduces residue (takes more time).
- Adjust for hard songs: when vocals remain in the instrumental / noise rides on the instrumental, revisit
segment_sizeandoverlap. If stuck, go to troubleshooting.
Rather than judging "did it get better" by ear alone, confirming with numbers is more stable. Objective evaluation with SDR, etc., is in the quality-evaluation guide.
⚠️ Most important: copyright cautions
Being able to separate technically and being legally allowed to use it are separate matters. Get this wrong and your carefully made work becomes a source of removal/trouble.
- Unauthorized publication/distribution of karaoke tracks or a cappella obtained by separating commercial songs can be copyright infringement. The basic rule is to keep it within personal use.
- When posting "singing" or "cover" videos, always confirm the scope of the music license each platform (YouTube, various streaming services, etc.) holds. Use of the master (the recording itself) involves the issue of master rights separately from the rights to the composition.
- Your own songs, licensed material, public domain can be used freely.
- The UVR5 and audio-separator software itself is MIT-licensed, but that is unrelated to the rights of the audio you process.
When in doubt, use the criterion "is it a song I made myself, or clearly licensed material?" If commercial use or publication is involved, confirm the rights situation with primary sources.
Frequently asked questions (FAQ)
Q. Is the same model fine for both karaoke and a cappella?
A. It's recommended to change it to match the goal. Karaoke (instrumental) is Inst-type (UVR-MDX-NET-Inst_HQ_3), a cappella (voice) is Vocal-type (Kim_Vocal_2). A model optimized for the stem you want gives higher quality for that stem.
Q. Some vocals remain in the instrumental.
A. Trying a more powerful model (RoFormer), ensembling, and raising overlap work. The harder the mix, the more the difference of a high-quality model shows.
Q. I want to remove just the harmony (keep the main vocal).
A. UVR_MDXNET_KARA_2 (Karaoke-type) separates main and harmony/chorus. Extracting the a cappella first, then applying the KARA type, makes it easier to separate.
Q. Is it impossible without a GPU?
A. It runs on CPU but is slow (the [cpu] version). Mac (Apple Silicon) doesn't support CUDA, so CPU execution is the default. An NVIDIA GPU environment is comfortable for bulk processing. If stuck, troubleshooting.
Q. Can I upload a karaoke track I made to YouTube? A. Be careful if it's based on a commercial song. Unauthorized use of the master can be a rights infringement. Confirm the scope of each platform's music license, and use your own songs/licensed material to be safe.
Conclusion: choose the model by goal, balancing quality and rights
Making karaoke/a cappella is 90% "choosing the model matching the goal."
- Karaoke (instrumental) → Inst-type, a cappella (voice) → Vocal-type, harmony removal → KARA-type, best quality → RoFormer.
- Easy is the GUI; bulk is an audio-separator batch.
- Output in wav/flac, and dial in with ensemble/objective evaluation if needed.
- Always confirm copyright — within the scope of your own/licensed material/personal use.
If you want to build voice/video AI including source separation at production quality as a business (automatic karaoke generation, bulk processing for streaming, etc.), consult me along with my track record. With one person × generative AI, I support end-to-end from planning to production operation.
Sources / official resources
- UVR5 itself: Anjok07/ultimatevocalremovergui
- Library to use in code: nomadkaraoke/python-audio-separator (MIT) / PyPI
- Details of model selection: this blog, how to choose a source-separation tool
- Best-quality models: this blog, BS-RoFormer/Mel-Band RoFormer guide
- Model names and defaults get updated. Confirm what exists with
audio-separator --list_models. Copyright is handled differently by country and platform. For commercial use/publication, always confirm primary sources and each service's terms.