# Complete guide to making karaoke tracks and a cappella with UVR5: instrumental extraction / vocal extraction / harmony removal

> A practical guide to making karaoke tracks (instrumental), a cappella (vocals), and harmony removal from songs with UVR5 (MDX-Net). It explains, in a form you can make today even as a first-timer: recommended models per use (Inst-type/Vocal-type/KARA_2), both GUI and code procedures, tips to raise audio quality, batch processing of an album, and the easy-to-miss copyright cautions.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: UVR5, カラオケ, ボーカル抽出, 音源分離, アカペラ, MDX-Net, Python, AI音声
- URL: https://tomodahinata.com/en/blog/uvr5-karaoke-instrumental-acapella-vocal-extraction-guide
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- Choose the model by goal: karaoke (instrumental) is Inst-type (UVR-MDX-NET-Inst_HQ_3), a cappella (voice) is Vocal-type (Kim_Vocal_2), harmony removal is KARA_2, and the best quality is BS-RoFormer.
- To try easily, the UVR5 GUI (drag & drop); for bulk/automation, audio-separator (pip install). Both can use the same models.
- Quality tips: output wav/flac (mp3 degrades), ensemble a vocal-specialized + instrumental-specialized model, adjust segment_size/overlap for hard songs, and confirm with objective evaluation at the end.
- Album batches can be automated with a CLI batch. Don't hand-process song by song.
- Most important caution: unauthorized publication/distribution of tracks obtained by separating commercial songs can be copyright infringement. For cover/'singing' posts, confirm the licensed scope of each platform.

---

## The goal of this article

"I want to extract **just the instrumental** from a song to make karaoke," "I want to pull **just the vocals** for a cappella/remix material," "I want to **remove the harmony (chorus)** and leave only the lead vocal" — all of these can be achieved with **UVR5 (Ultimate Vocal Remover) + MDX-Net.**

This piece isn't a technical deep-dive but a **practical guide that "even a first-timer can actually make today."** When you finish reading, you'll be able to:

1. **Choose the model best suited to your goal** (karaoke / a cappella / harmony removal).
2. Run separation in **both the GUI (easy) and code (bulk/automation).**
3. Grasp **tips to raise audio quality** and the easy-to-miss **copyright cautions.**

> **About the author**: I have **single-handedly designed, implemented, and run in production an AI video/audio-processing platform** that has source separation as its first stage. The "models per use" and "quality tips" here are knowledge I've verified by actually processing a large volume of audio. The technical mechanism is collected in the [UVR5/MDX-Net guide](/blog/uvr5-mdx-net-vocal-separation-production-guide), and for aiming at the best quality, the [RoFormer guide](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide).

---

## 30-second summary (goal → model to use)

| What you want to make | Model to use (type) | One line |
| --- | --- | --- |
| **Karaoke track (instrumental only)** | `UVR-MDX-NET-Inst_HQ_3` (Inst-type) | The standard for keeping the instrumental clean |
| **A cappella (vocals only)** | `Kim_Vocal_2` (Vocal-type) | Fast, high-quality vocal extraction |
| **Harmony/chorus removal (lead voice only)** | `UVR_MDXNET_KARA_2` (Karaoke-type) | Separates main and harmony |
| **Just the best quality** | `model_bs_roformer_ep_317_...` (RoFormer) | Heavy but the current best quality |
| **Want to split drums/bass too** | `htdemucs` (Demucs v4) | 4-stem separation |

- **Try easily** → UVR5 GUI (drag & drop)
- **Bulk/automation** → `audio-separator` (code/CLI)
- **Dial in quality** → wav/flac output, ensemble, objective evaluation

> 💡 Model filenames get updated. Confirm what exists with `audio-separator --list_models`. For the meaning of the types and more detailed selection, see the [selection guide](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter).

---

## First, the mechanism in one line

UVR5 is a tool that **separates one song into "vocals (voice)" and "instrumental (accompaniment)."** That is —

- **Extract the Instrumental → a karaoke track**
- **Extract the Vocals → a cappella**

Models come in "those good at keeping the instrumental clean (Inst-type)" and "those good at cleanly pulling the voice (Vocal-type)," and **choosing the one matching what you want** is the first step to quality.

---

## Method A: easily with the UVR5 GUI (no code)

For just one song, or to try first, the GUI is the shortest path.

1. Get the installer matching your OS (Windows / macOS / Linux) from the [UVR5 official releases](https://github.com/Anjok07/ultimatevocalremovergui/releases).
2. Launch it and select **MDX-Net** in **Process Method.**
3. Choose the **model** by goal:
   - Karaoke → `UVR-MDX-NET-Inst_HQ_3`
   - A cappella → `Kim_Vocal_2`
   - Harmony removal → `UVR_MDXNET_KARA_2`
4. **Drag & drop** the song file, specify the output destination, and **Start Processing.**
5. The output has two files, `(Instrumental)` and `(Vocals)`.

The GUI is easy, but it's **hand-work song by song.** For album batches or periodic processing, the code version below is overwhelmingly faster.

---

## Method B: with audio-separator (bulk/automation)

With [audio-separator](https://github.com/nomadkaraoke/python-audio-separator) (MIT), you can run the same models as UVR5 **from code or the CLI.**

```bash
pip install "audio-separator[gpu]"   # GPU。CPU/Macは [cpu]
```

### Karaoke track (instrumental only)

```bash
# 伴奏だけ出力（--single_stem Instrumental で片方だけ書き出し）
audio-separator song.wav \
  --model_filename UVR-MDX-NET-Inst_HQ_3.onnx \
  --single_stem Instrumental \
  --output_format flac
```

### A cappella (vocals only)

```python
# acapella.py — ボーカルだけを抽出する
from audio_separator.separator import Separator

sep = Separator(output_dir="out", output_format="flac",
                output_single_stem="Vocals")   # 声だけ書き出し
sep.load_model(model_filename="Kim_Vocal_2.onnx")
print(sep.separate("song.wav"))                # -> ボーカルtrack
```

### Harmony/chorus removal (lead vocal only)

```python
# ハモリ(コーラス)を分けてメインボーカルを残す
sep = Separator(output_dir="out", output_format="flac")
sep.load_model(model_filename="UVR_MDXNET_KARA_2.onnx")
print(sep.separate("acapella.wav"))   # メイン/ハモリに分離
```

> 🔧 Make the output format **wav or flac** (lossless). **mp3 degrades.** Note that the CLI default is FLAC while the Python library default is WAV — they differ (being explicit is safe).

---

## Batch-process an album

"Song by song" breaks down as the count grows. Process all songs in a folder together.

```python
# batch_karaoke.py — フォルダ内の全曲をカラオケ化（伴奏抽出）
from pathlib import Path
from audio_separator.separator import Separator

sep = Separator(output_dir="karaoke", output_format="flac",
                output_single_stem="Instrumental")
sep.load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx")  # モデルは一度だけロード

for track in sorted(Path("album").glob("*.wav")):
    print("processing:", track.name)
    sep.separate(str(track))      # 同じSeparatorを再利用＝高速
```

The point is to **load the model once and reuse it** (loading is heavy). Even hundreds of songs can run overnight this way. For full production scale (thousands of songs, queue-driven), see the [GPU-worker platform article](/blog/music-source-separation-production-api-gpu-worker-queue).

---

## Five tips to raise audio quality

1. **Choose the model matching the goal**: Inst-type for karaoke, Vocal-type for a cappella. Using the opposite drops quality.
2. **Output in a lossless format**: wav/flac. mp3 degrades further after separation.
3. **If you need the best quality, RoFormer**: `model_bs_roformer_...` is the current best quality (heavy). Details in the [RoFormer guide](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide).
4. **Ensemble**: combining a vocal-specialized and an instrumental-specialized model reduces residue (takes more time).
5. **Adjust for hard songs**: when vocals remain in the instrumental / noise rides on the instrumental, revisit `segment_size` and `overlap`. If stuck, go to [troubleshooting](/blog/uvr5-audio-separator-troubleshooting-gpu-cuda-oom).

Rather than judging "did it get better" by ear alone, **confirming with numbers** is more stable. Objective evaluation with SDR, etc., is in the [quality-evaluation guide](/blog/music-source-separation-quality-evaluation-sdr-museval).

---

## ⚠️ Most important: copyright cautions

Being able to separate technically and **being legally allowed to use it are separate matters.** Get this wrong and your carefully made work becomes a source of removal/trouble.

- **Unauthorized publication/distribution of karaoke tracks or a cappella obtained by separating commercial songs can be copyright infringement.** The basic rule is to keep it within personal use.
- **When posting "singing" or "cover" videos**, always confirm the **scope of the music license** each platform (YouTube, various streaming services, etc.) holds. Use of the master (the recording itself) involves the issue of **master rights** separately from the rights to the composition.
- **Your own songs, licensed material, public domain** can be used freely.
- The UVR5 and audio-separator software itself is MIT-licensed, but that is **unrelated to the rights of the audio you process.**

> When in doubt, use the criterion "**is it a song I made myself, or clearly licensed material?**" If commercial use or publication is involved, confirm the rights situation with primary sources.

---

## Frequently asked questions (FAQ)

**Q. Is the same model fine for both karaoke and a cappella?**
A. It's recommended to **change it to match the goal.** Karaoke (instrumental) is **Inst-type (`UVR-MDX-NET-Inst_HQ_3`)**, a cappella (voice) is **Vocal-type (`Kim_Vocal_2`)**. A model optimized for the stem you want gives higher quality for that stem.

**Q. Some vocals remain in the instrumental.**
A. Trying a more powerful model ([RoFormer](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide)), **ensembling**, and raising `overlap` work. The harder the mix, the more the difference of a high-quality model shows.

**Q. I want to remove just the harmony (keep the main vocal).**
A. `UVR_MDXNET_KARA_2` (Karaoke-type) separates main and harmony/chorus. Extracting the a cappella first, then applying the KARA type, makes it easier to separate.

**Q. Is it impossible without a GPU?**
A. It runs on CPU but is **slow** (the `[cpu]` version). Mac (Apple Silicon) doesn't support CUDA, so CPU execution is the default. An NVIDIA GPU environment is comfortable for bulk processing. If stuck, [troubleshooting](/blog/uvr5-audio-separator-troubleshooting-gpu-cuda-oom).

**Q. Can I upload a karaoke track I made to YouTube?**
A. **Be careful if it's based on a commercial song.** Unauthorized use of the master can be a rights infringement. Confirm the scope of each platform's music license, and use your own songs/licensed material to be safe.

---

## Conclusion: choose the model by goal, balancing quality and rights

Making karaoke/a cappella is **90% "choosing the model matching the goal."**

1. **Karaoke (instrumental) → Inst-type, a cappella (voice) → Vocal-type, harmony removal → KARA-type, best quality → RoFormer.**
2. **Easy is the GUI; bulk is an audio-separator batch.**
3. **Output in wav/flac, and dial in with ensemble/objective evaluation if needed.**
4. **Always confirm copyright** — within the scope of your own/licensed material/personal use.

> If you want to build voice/video AI including source separation **at production quality as a business** (automatic karaoke generation, bulk processing for streaming, etc.), consult me along with my [track record](/case-studies/ai-video-localization-lipsync). With **one person × generative AI**, I support end-to-end from planning to production operation.

---

## Sources / official resources

- **UVR5 itself**: [Anjok07/ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui)
- **Library to use in code**: [nomadkaraoke/python-audio-separator (MIT)](https://github.com/nomadkaraoke/python-audio-separator) / [PyPI](https://pypi.org/project/audio-separator/)
- **Details of model selection**: this blog, [how to choose a source-separation tool](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter)
- **Best-quality models**: this blog, [BS-RoFormer/Mel-Band RoFormer guide](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide)

* Model names and defaults get updated. Confirm what exists with `audio-separator --list_models`. Copyright is handled differently by country and platform. For commercial use/publication, always confirm primary sources and each service's terms.
