# Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

> An explanation, faithful to the official papers, of source separation's current SOTA: BS-RoFormer (Band-Split RoPE Transformer) and Mel-Band RoFormer. It shows the implementation needed for production: why it's the highest quality (band-split × RoPE Transformer), the SDX23 1st-place / MUSDB18HQ 9.80 dB track record, the execution code in audio-separator, the reality of VRAM/speed and OOM countermeasures, and choosing between it and MDX-Net.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: BS-RoFormer, Mel-Band RoFormer, 音源分離, ボーカル抽出, AI音声, Python, 生成AI, MLOps
- URL: https://tomodahinata.com/en/blog/bs-roformer-mel-band-roformer-vocal-separation-guide
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- BS-RoFormer (ByteDance, arXiv:2309.02612, ICASSP2024) is source separation's current SOTA, band-splitting the complex spectrogram and modeling inner-band/inter-band with a hierarchical Transformer and RoPE.
- Its record is 1st place in SDX23's standard MSS division, and even the small version reaches 9.80 dB SDR on MUSDB18HQ without extra data. It surpasses Demucs v4's 9.00–9.20 dB.
- Mel-Band RoFormer (arXiv:2310.01809) is a derivative that makes band-splitting into overlapping Mel-scale bands, reported to surpass BS-RoFormer on vocals/drums/other.
- audio-separator's default model is BS-RoFormer (model_bs_roformer_ep_317_sdr_12.9755). Just specify the filename in load_model and call it from the same API as MDX-Net and Demucs.
- The price is being the heaviest, slowest, and most VRAM-hungry. The practical solution is to use RoFormer for top-quality final delivery and MDX-Net for speed/cost-focused bulk processing.

---

## The goal of this article

As you research source-separation models, you always arrive at **RoFormer.** Not Demucs, not UVR5/MDX-Net, but the lineage said to have **the highest separation quality as of 2026** — that's **BS-RoFormer (Band-Split RoPE Transformer)** and **Mel-Band RoFormer.** In fact, **the default model of the standard library [audio-separator](/blog/uvr5-mdx-net-vocal-separation-production-guide) is BS-RoFormer.**

This piece explains this latest SOTA **faithfully to the official papers (arXiv)** and takes it to where you can **use it in production starting today.** When you finish reading, you'll be able to:

1. Explain to others **why RoFormer is the highest quality** (band-split × RoPE Transformer).
2. **Run BS/Mel-Band RoFormer in audio-separator** and confirm the quality on your own audio.
3. Judge **when you should use RoFormer and when to stay with MDX-Net**, given the reality of VRAM and speed.

> **About the author (reliability disclosure)**: I have **single-handedly designed, implemented, and run in production an AI video-localization platform** that has source separation as its first stage. I operate a design that switches models per material — "MDX-Net for speed, RoFormer for top quality" — and the choices here are knowledge from **actually weighing cost against quality.** For how to choose tools overall, see the [selection guide](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter); Demucs is collected in its [dedicated guide](/blog/demucs-v4-music-source-separation-production-guide).

---

## 30-second summary (conclusion first)

| Aspect | Conclusion |
| --- | --- |
| **What model** | Source separation's **current SOTA.** Band-splits the complex spectrogram and estimates the mask with Transformer + RoPE |
| **BS-RoFormer** | Band-Split RoPE Transformer ([arXiv:2309.02612](https://arxiv.org/abs/2309.02612), ByteDance, ICASSP2024) |
| **Mel-Band RoFormer** | Makes band-splitting into overlapping Mel-scale bands ([arXiv:2310.01809](https://arxiv.org/abs/2310.01809)). Surpasses BS on some stems |
| **Record** | **1st in SDX23's standard MSS division**, and even the small version **9.80 dB** on MUSDB18HQ without extra data |
| **Usage** | **audio-separator's default model.** Just specify the filename in `load_model` |
| **Price** | **The heaviest, slowest, most VRAM-hungry** (community knowledge) |
| **When to use** | Top-quality final delivery, the hero cut of an upload, difficult material |
| **When to stay** | Bulk batches, low VRAM, speed/cost focus → **MDX-Net** |
| **License** | audio-separator is MIT. **Confirm each model weight's distribution terms and the audio's copyright separately** |

---

## Why RoFormer is the highest quality (paper-based, made simple)

This chapter breaks down the core of RoFormer **while keeping accuracy.** If you only want the implementation, feel free to jump to [usage](#how-to-use-run-roformer-with-audio-separator).

### The starting point: see the spectrogram by "band"

The frequency-domain approach to source separation converts audio with STFT into a **complex spectrogram (time × frequency)** and estimates each component's mask. The problem is that **treating the whole frequency range uniformly looks at the thick low-frequency components and the delicate high-frequency components with the same weight.**

### BS-RoFormer's answer: band-split × hierarchical Transformer × RoPE

The abstract of BS-RoFormer ([arXiv:2309.02612](https://arxiv.org/abs/2309.02612), Lu, Wang, Kong, Hung, ByteDance SAMI) succinctly expresses the design.

> *BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE).*

It decomposes into three elements.

1. **Band-split**: projects the complex spectrogram into **per-frequency-band (subband) representations.** The granularity can be varied to match the structure of frequency.
2. **Hierarchical Transformer**: relates **both inner-band and inter-band** sequences with stacked Transformers. It takes both "the temporal variation within a band" and "the relationship between bands."
3. **RoPE (Rotary Position Embedding)**: a mechanism that **gives position information to the Transformer via rotation.** It handles positional relationships stably even for long sequences and aids separation learning.

This "**structuring frequency by band and modeling inner/inter-band simultaneously with a Transformer**" is the source of quality surpassing the conventional U-Net family ([MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) and [Demucs](/blog/demucs-v4-music-source-separation-production-guide)).

### Mel-Band RoFormer: bringing the band-cutting closer to "human hearing"

**Mel-Band RoFormer** ([arXiv:2310.01809](https://arxiv.org/abs/2310.01809), Wang, Lu, Won) is a derivative that changes **how BS-RoFormer band-splits.** In the paper's words —

> *we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale.*

Whereas BS-RoFormer's band-split was **a heuristic non-overlapping one**, Mel-Band RoFormer splits into **overlapping bands along the Mel scale** (the Mel scale = a frequency scale close to human hearing). The Transformer+RoPE backend is shared, and **only the band-split at the entrance differs.** The paper reports it "**surpasses BS-RoFormer on vocals / drums / other**" (the abstract gives no specific dB value).

> 🧩 **Where the models come from**: these RoFormer checkpoints are often trained/distributed in [ZFTurbo's MSST framework](https://github.com/ZFTurbo/Music-Source-Separation-Training) (MIT) and the MVSep ecosystem, and can be downloaded directly from audio-separator. The RoFormer model code is widely used via lucidrains' reimplementation.

---

## Track record: the numbers of SDX23 and MUSDB18HQ

RoFormer's strength is backed by **the numbers of competitions and benchmarks** (primary sources only).

| Metric | Value | Condition | Source |
| --- | --- | --- | --- |
| **SDX23 MSS standard division** | **1st** (team SAMI-ByteDance = BS-RoFormer) | Arbitrary data allowed | [SDX23 paper](https://transactions.ismir.net/articles/10.5334/tismir.171) |
| **BS-RoFormer small version** | **9.80 dB SDR** | MUSDB18HQ, no extra data | [arXiv:2309.02612](https://arxiv.org/abs/2309.02612) |
| Reference: Demucs v4 (htdemucs) | 9.00 dB (base) / 9.20 dB (FT) | MUSDB HQ | [Demucs](https://github.com/facebookresearch/demucs) |

> ⚠️ **How to read the numbers**: a 0.x–1 dB difference in SDR is, depending on the use, audible "once it's pointed out." RoFormer is indeed the highest quality, but **whether that difference is worth the price of speed, VRAM, and cost** depends on the use. How to objectively evaluate quality is in the [quality-evaluation guide](/blog/music-source-separation-quality-evaluation-sdr-museval), and cross-model selection in the [selection guide](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter).

---

## How to use: run RoFormer with audio-separator

The biggest good news is that **RoFormer can be called from the same `Separator` API as [MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) and [Demucs](/blog/demucs-v4-music-source-separation-production-guide).** What's more, since **audio-separator's default model is BS-RoFormer**, RoFormer is used automatically if you don't specify a model.

```bash
pip install "audio-separator[gpu]"   # GPU。CPUは [cpu]
```

```python
# roformer_quickstart.py — BS-RoFormerで最高品質の分離を行う
from audio_separator.separator import Separator

separator = Separator(output_dir="stems", output_format="flac")  # 可逆で品質維持

# 既定がBS-RoFormer。明示するならファイル名を渡す
separator.load_model(model_filename="model_bs_roformer_ep_317_sdr_12.9755.ckpt")

output_files = separator.separate("song.wav")
print(output_files)   # ['..._(Vocals)_....flac', '..._(Instrumental)_....flac']
```

Representative RoFormer models (**always confirm what exists with `audio-separator --list_models`**):

| Model filename | Lineage | Use |
| --- | --- | --- |
| `model_bs_roformer_ep_317_sdr_12.9755.ckpt` | BS-RoFormer | **audio-separator default.** General-purpose, high quality |
| `model_bs_roformer_ep_368_sdr_12.9628.ckpt` | BS-RoFormer | A high-quality version of a different epoch |
| `vocals_mel_band_roformer.ckpt` | Mel-Band RoFormer | Vocal extraction |
| `mel_band_roformer_kim_ft_unwa.ckpt` | Mel-Band RoFormer | A vocal-specialized FT version |

```bash
# RoFormer系の"今この瞬間"の正確なファイル名を確認
audio-separator --list_models --list_filter roformer
```

> 🔧 Model names (especially the long ones with epoch/SDR) get updated in distribution. The names here are confirmed at writing time, but **confirm what exists with `--list_models`.** For RoFormer's fine inference parameters, the iron rule is to first run **with the official defaults as-is** and adjust only the points you're dissatisfied with.

---

## The price: the reality of VRAM and speed

RoFormer is **not just a fast, light silver bullet.** The following is community/practical knowledge (🟡 no unified official benchmark exists).

| Aspect | MDX-Net | Demucs v4 | **BS/Mel-RoFormer** |
| --- | --- | --- | --- |
| Quality | Mid–high | High | **Highest** |
| Speed | Fast | Mid | **Slow** |
| VRAM | Low | Mid | **High** |

Because it runs a large Transformer across band × time, it's **the most computationally heavy, eats the most VRAM, and has the longest processing time** — that's the price of quality. RoFormer is also the one most prone to hitting `CUDA out of memory` in long/bulk processing.

**The basics of an OOM countermeasure** is lowering `segment_size` (the chunk loaded onto the GPU). Including the trap where the GPU falls back to CPU and becomes "works but agonizingly slow," the sticking points are systematized in the [troubleshooting guide](/blog/uvr5-audio-separator-troubleshooting-gpu-cuda-oom).

---

## When to use RoFormer and when to stay with MDX-Net

"Just always use the highest quality" is, in production, **wrong from the standpoint of cost and throughput.** Use them differently by purpose.

**When you should use RoFormer:**

- **Top-quality final delivery** (the hero cut of an upload, clean separation of commercial audio).
- **Difficult material** (dense mixes, songs where vocals and instruments overlap in close bands).
- When you **can tolerate** the processing time, VRAM, and cost.

**When you should stay with MDX-Net:**

- When you want to **carve down the unit price in bulk batches** (speed and VRAM efficiency matter).
- When you only have a **low-VRAM environment.**
- When the quality requirement is "sufficient" and **need not be the highest.**

In practice, a **two-tier setup** works — **draft fast with MDX-Net for a full overview → final-generate only the adopted cuts with RoFormer.** This balances quality and cost. For cross-model decision-making, see the [selection guide](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter), and for production scale, the [GPU-worker platform article](/blog/music-source-separation-production-api-gpu-worker-queue).

---

## Frequently asked questions (FAQ)

**Q. BS-RoFormer or Mel-Band RoFormer — which to use?**
A. First **BS-RoFormer** (audio-separator's default, with a rich record). Since the paper reports **Mel-Band RoFormer surpasses on vocals/drums/other**, if you want to dial in quality further for vocal extraction, also try the Mel-Band version and decide by **[objectively evaluating](/blog/music-source-separation-quality-evaluation-sdr-museval) on your own audio.**

**Q. Is RoFormer always better than MDX-Net or Demucs?**
A. **In quality (SDR) it's above**, but **at a disadvantage in speed and VRAM.** Whether a 0.x–1 dB difference is worth the cost depends on the use. Use MDX-Net for bulk, RoFormer for top quality, and Demucs for 4 stems.

**Q. How much GPU is needed?**
A. Since RoFormer eats VRAM, **the larger the GPU's VRAM, the more stable.** If insufficient, lower `segment_size`. Handling `CUDA out of memory` is in the [troubleshooting guide](/blog/uvr5-audio-separator-troubleshooting-gpu-cuda-oom).

**Q. Can it be used commercially?**
A. **audio-separator is MIT**, but **each model weight's distribution terms are individual**, and above all **the copyright of the audio you process (commercial songs, etc.) is a separate matter.** For commercial use, always confirm with primary sources.

**Q. I was using it without knowing the default model is RoFormer.**
A. Yes, **audio-separator uses BS-RoFormer** (`model_bs_roformer_ep_317_sdr_12.9755`) when no model is specified. If you want it lighter, explicitly specify MDX-Net like `load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx")`.

---

## Conclusion: leverage the highest quality at "the right spot"

RoFormer's essence is in **structuring frequency by band and modeling inner-band/inter-band simultaneously with a Transformer and RoPE.** That's exactly why it's the current SOTA, and exactly why it's heavy.

1. **If you need quality, RoFormer**: the strength of SDX23 1st place / MUSDB18HQ 9.80 dB.
2. **Implementation is easy**: audio-separator's default model. Just specify the filename in `load_model`.
3. **Understand the price**: the slowest, most VRAM-hungry. Handle OOM with `segment_size`.
4. **Using them differently is the right answer**: MDX-Net for bulk, RoFormer for top quality, Demucs for 4 stems.

> "**Knowing the highest-quality model**" and "**using the optimal model differently by purpose, getting quality while keeping cost down in production**" are different things. The latter is where outsourcing makes a difference. For building and improving voice/video AI including source separation, consult me along with my [track record](/case-studies/ai-video-localization-lipsync). With **one person × generative AI**, I support end-to-end from design to production operation.

---

## Sources / official resources

- **BS-RoFormer**: [arXiv:2309.02612](https://arxiv.org/abs/2309.02612) (Music Source Separation with Band-Split RoPE Transformer, ByteDance SAMI, ICASSP 2024)
- **Mel-Band RoFormer**: [arXiv:2310.01809](https://arxiv.org/abs/2310.01809) (Mel-Band RoFormer for Music Source Separation)
- **SDX23 (Sound Demixing Challenge 2023)**: [AIcrowd](https://www.aicrowd.com/challenges/sound-demixing-challenge-2023) / [TISMIR paper](https://transactions.ismir.net/articles/10.5334/tismir.171)
- **Training/inference framework**: [ZFTurbo/Music-Source-Separation-Training (MIT)](https://github.com/ZFTurbo/Music-Source-Separation-Training)
- **Execution library**: [nomadkaraoke/python-audio-separator (MIT)](https://github.com/nomadkaraoke/python-audio-separator)

* Model names, SDR, and licenses get updated. **Always confirm primary sources before implementation** (especially model names with `audio-separator --list_models`). The speed/VRAM trade-off is community knowledge; measuring on your own hardware is recommended.
