Complete guide to making karaoke tracks and a cappella with UVR5: instrumental extraction / vocal extraction / harmony removal

The goal of this article

"I want to extract just the instrumental from a song to make karaoke," "I want to pull just the vocals for a cappella/remix material," "I want to remove the harmony (chorus) and leave only the lead vocal" — all of these can be achieved with UVR5 (Ultimate Vocal Remover) + MDX-Net.

This piece isn't a technical deep-dive but a practical guide that "even a first-timer can actually make today." When you finish reading, you'll be able to:

Choose the model best suited to your goal (karaoke / a cappella / harmony removal).
Run separation in both the GUI (easy) and code (bulk/automation).
Grasp tips to raise audio quality and the easy-to-miss copyright cautions.

About the author: I have single-handedly designed, implemented, and run in production an AI video/audio-processing platform that has source separation as its first stage. The "models per use" and "quality tips" here are knowledge I've verified by actually processing a large volume of audio. The technical mechanism is collected in the UVR5/MDX-Net guide, and for aiming at the best quality, the RoFormer guide.

30-second summary (goal → model to use)

What you want to make	Model to use (type)	One line
Karaoke track (instrumental only)	`UVR-MDX-NET-Inst_HQ_3` (Inst-type)	The standard for keeping the instrumental clean
A cappella (vocals only)	`Kim_Vocal_2` (Vocal-type)	Fast, high-quality vocal extraction
Harmony/chorus removal (lead voice only)	`UVR_MDXNET_KARA_2` (Karaoke-type)	Separates main and harmony
Just the best quality	`model_bs_roformer_ep_317_...` (RoFormer)	Heavy but the current best quality
Want to split drums/bass too	`htdemucs` (Demucs v4)	4-stem separation

Try easily → UVR5 GUI (drag & drop)
Bulk/automation → audio-separator (code/CLI)
Dial in quality → wav/flac output, ensemble, objective evaluation

💡 Model filenames get updated. Confirm what exists with audio-separator --list_models. For the meaning of the types and more detailed selection, see the selection guide.

First, the mechanism in one line

UVR5 is a tool that separates one song into "vocals (voice)" and "instrumental (accompaniment)." That is —

Extract the Instrumental → a karaoke track
Extract the Vocals → a cappella

Models come in "those good at keeping the instrumental clean (Inst-type)" and "those good at cleanly pulling the voice (Vocal-type)," and choosing the one matching what you want is the first step to quality.

Method A: easily with the UVR5 GUI (no code)

For just one song, or to try first, the GUI is the shortest path.

Get the installer matching your OS (Windows / macOS / Linux) from the UVR5 official releases.
Launch it and select MDX-Net in Process Method.
Choose the model by goal:
- Karaoke → UVR-MDX-NET-Inst_HQ_3
- A cappella → Kim_Vocal_2
- Harmony removal → UVR_MDXNET_KARA_2
Drag & drop the song file, specify the output destination, and Start Processing.
The output has two files, (Instrumental) and (Vocals).

The GUI is easy, but it's hand-work song by song. For album batches or periodic processing, the code version below is overwhelmingly faster.

Method B: with audio-separator (bulk/automation)

With audio-separator (MIT), you can run the same models as UVR5 from code or the CLI.

pip install "audio-separator[gpu]"   # GPU。CPU/Macは [cpu]

Karaoke track (instrumental only)

# 伴奏だけ出力（--single_stem Instrumental で片方だけ書き出し）
audio-separator song.wav \
  --model_filename UVR-MDX-NET-Inst_HQ_3.onnx \
  --single_stem Instrumental \
  --output_format flac

A cappella (vocals only)

# acapella.py — ボーカルだけを抽出する
from audio_separator.separator import Separator

sep = Separator(output_dir="out", output_format="flac",
                output_single_stem="Vocals")   # 声だけ書き出し
sep.load_model(model_filename="Kim_Vocal_2.onnx")
print(sep.separate("song.wav"))                # -> ボーカルtrack

Harmony/chorus removal (lead vocal only)

# ハモリ(コーラス)を分けてメインボーカルを残す
sep = Separator(output_dir="out", output_format="flac")
sep.load_model(model_filename="UVR_MDXNET_KARA_2.onnx")
print(sep.separate("acapella.wav"))   # メイン/ハモリに分離

🔧 Make the output format wav or flac (lossless). mp3 degrades. Note that the CLI default is FLAC while the Python library default is WAV — they differ (being explicit is safe).

Batch-process an album

"Song by song" breaks down as the count grows. Process all songs in a folder together.

# batch_karaoke.py — フォルダ内の全曲をカラオケ化（伴奏抽出）
from pathlib import Path
from audio_separator.separator import Separator

sep = Separator(output_dir="karaoke", output_format="flac",
                output_single_stem="Instrumental")
sep.load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx")  # モデルは一度だけロード

for track in sorted(Path("album").glob("*.wav")):
    print("processing:", track.name)
    sep.separate(str(track))      # 同じSeparatorを再利用＝高速

The point is to load the model once and reuse it (loading is heavy). Even hundreds of songs can run overnight this way. For full production scale (thousands of songs, queue-driven), see the GPU-worker platform article.

Five tips to raise audio quality

Choose the model matching the goal: Inst-type for karaoke, Vocal-type for a cappella. Using the opposite drops quality.
Output in a lossless format: wav/flac. mp3 degrades further after separation.
If you need the best quality, RoFormer: model_bs_roformer_... is the current best quality (heavy). Details in the RoFormer guide.
Ensemble: combining a vocal-specialized and an instrumental-specialized model reduces residue (takes more time).
Adjust for hard songs: when vocals remain in the instrumental / noise rides on the instrumental, revisit segment_size and overlap. If stuck, go to troubleshooting.

Rather than judging "did it get better" by ear alone, confirming with numbers is more stable. Objective evaluation with SDR, etc., is in the quality-evaluation guide.

⚠️ Most important: copyright cautions

Being able to separate technically and being legally allowed to use it are separate matters. Get this wrong and your carefully made work becomes a source of removal/trouble.

Unauthorized publication/distribution of karaoke tracks or a cappella obtained by separating commercial songs can be copyright infringement. The basic rule is to keep it within personal use.
When posting "singing" or "cover" videos, always confirm the scope of the music license each platform (YouTube, various streaming services, etc.) holds. Use of the master (the recording itself) involves the issue of master rights separately from the rights to the composition.
Your own songs, licensed material, public domain can be used freely.
The UVR5 and audio-separator software itself is MIT-licensed, but that is unrelated to the rights of the audio you process.

When in doubt, use the criterion "is it a song I made myself, or clearly licensed material?" If commercial use or publication is involved, confirm the rights situation with primary sources.

Frequently asked questions (FAQ)

Q. Is the same model fine for both karaoke and a cappella? A. It's recommended to change it to match the goal. Karaoke (instrumental) is Inst-type (UVR-MDX-NET-Inst_HQ_3), a cappella (voice) is Vocal-type (Kim_Vocal_2). A model optimized for the stem you want gives higher quality for that stem.

Q. Some vocals remain in the instrumental. A. Trying a more powerful model (RoFormer), ensembling, and raising overlap work. The harder the mix, the more the difference of a high-quality model shows.

Q. I want to remove just the harmony (keep the main vocal). A. UVR_MDXNET_KARA_2 (Karaoke-type) separates main and harmony/chorus. Extracting the a cappella first, then applying the KARA type, makes it easier to separate.

Q. Is it impossible without a GPU? A. It runs on CPU but is slow (the [cpu] version). Mac (Apple Silicon) doesn't support CUDA, so CPU execution is the default. An NVIDIA GPU environment is comfortable for bulk processing. If stuck, troubleshooting.

Q. Can I upload a karaoke track I made to YouTube? A. Be careful if it's based on a commercial song. Unauthorized use of the master can be a rights infringement. Confirm the scope of each platform's music license, and use your own songs/licensed material to be safe.

Conclusion: choose the model by goal, balancing quality and rights

Making karaoke/a cappella is 90% "choosing the model matching the goal."

Karaoke (instrumental) → Inst-type, a cappella (voice) → Vocal-type, harmony removal → KARA-type, best quality → RoFormer.
Easy is the GUI; bulk is an audio-separator batch.
Output in wav/flac, and dial in with ensemble/objective evaluation if needed.
Always confirm copyright — within the scope of your own/licensed material/personal use.

If you want to build voice/video AI including source separation at production quality as a business (automatic karaoke generation, bulk processing for streaming, etc.), consult me along with my track record. With one person × generative AI, I support end-to-end from planning to production operation.

Sources / official resources

UVR5 itself: Anjok07/ultimatevocalremovergui
Library to use in code: nomadkaraoke/python-audio-separator (MIT) / PyPI
Details of model selection: this blog, how to choose a source-separation tool
Best-quality models: this blog, BS-RoFormer/Mel-Band RoFormer guide

Model names and defaults get updated. Confirm what exists with audio-separator --list_models. Copyright is handled differently by country and platform. For commercial use/publication, always confirm primary sources and each service's terms.

Complete guide to making karaoke tracks and a cappella with UVR5: instrumental extraction / vocal extraction / harmony removal

The goal of this article

30-second summary (goal → model to use)

First, the mechanism in one line

Method A: easily with the UVR5 GUI (no code)

Method B: with audio-separator (bulk/automation)

Karaoke track (instrumental only)

A cappella (vocals only)

Harmony/chorus removal (lead vocal only)

Batch-process an album

Five tips to raise audio quality

⚠️ Most important: copyright cautions

Frequently asked questions (FAQ)

Conclusion: choose the model by goal, balancing quality and rights

Sources / official resources

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Also worth reading

Python Data Types Complete Guide: The 'Right Use' of Numbers, Strings, and Collections, and Designs That Don't Break in Production

The Complete Guide to Python Mappings: dict Internals, Choosing Among collections, Designing Custom Mappings, and Production Operation

Run a backend on Vercel: operate Express, Hono, FastAPI, and NestJS in production with zero config

The goal of this article

30-second summary (goal → model to use)

First, the mechanism in one line

Method A: easily with the UVR5 GUI (no code)

Method B: with audio-separator (bulk/automation)

Karaoke track (instrumental only)

A cappella (vocals only)

Harmony/chorus removal (lead vocal only)

Batch-process an album

Five tips to raise audio quality

⚠️ Most important: copyright cautions

Frequently asked questions (FAQ)

Conclusion: choose the model by goal, balancing quality and rights

Sources / official resources

Related articles

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Also worth reading

Python Data Types Complete Guide: The 'Right Use' of Numbers, Strings, and Collections, and Designs That Don't Break in Production

The Complete Guide to Python Mappings: dict Internals, Choosing Among collections, Designing Custom Mappings, and Production Operation

Run a backend on Vercel: operate Express, Hono, FastAPI, and NestJS in production with zero config