The goal of this article
As you research source-separation models, you always arrive at RoFormer. Not Demucs, not UVR5/MDX-Net, but the lineage said to have the highest separation quality as of 2026 — that's BS-RoFormer (Band-Split RoPE Transformer) and Mel-Band RoFormer. In fact, the default model of the standard library audio-separator is BS-RoFormer.
This piece explains this latest SOTA faithfully to the official papers (arXiv) and takes it to where you can use it in production starting today. When you finish reading, you'll be able to:
- Explain to others why RoFormer is the highest quality (band-split × RoPE Transformer).
- Run BS/Mel-Band RoFormer in audio-separator and confirm the quality on your own audio.
- Judge when you should use RoFormer and when to stay with MDX-Net, given the reality of VRAM and speed.
About the author (reliability disclosure): I have single-handedly designed, implemented, and run in production an AI video-localization platform that has source separation as its first stage. I operate a design that switches models per material — "MDX-Net for speed, RoFormer for top quality" — and the choices here are knowledge from actually weighing cost against quality. For how to choose tools overall, see the selection guide; Demucs is collected in its dedicated guide.
30-second summary (conclusion first)
| Aspect | Conclusion |
|---|---|
| What model | Source separation's current SOTA. Band-splits the complex spectrogram and estimates the mask with Transformer + RoPE |
| BS-RoFormer | Band-Split RoPE Transformer (arXiv:2309.02612, ByteDance, ICASSP2024) |
| Mel-Band RoFormer | Makes band-splitting into overlapping Mel-scale bands (arXiv:2310.01809). Surpasses BS on some stems |
| Record | 1st in SDX23's standard MSS division, and even the small version 9.80 dB on MUSDB18HQ without extra data |
| Usage | audio-separator's default model. Just specify the filename in load_model |
| Price | The heaviest, slowest, most VRAM-hungry (community knowledge) |
| When to use | Top-quality final delivery, the hero cut of an upload, difficult material |
| When to stay | Bulk batches, low VRAM, speed/cost focus → MDX-Net |
| License | audio-separator is MIT. Confirm each model weight's distribution terms and the audio's copyright separately |
Why RoFormer is the highest quality (paper-based, made simple)
This chapter breaks down the core of RoFormer while keeping accuracy. If you only want the implementation, feel free to jump to usage.
The starting point: see the spectrogram by "band"
The frequency-domain approach to source separation converts audio with STFT into a complex spectrogram (time × frequency) and estimates each component's mask. The problem is that treating the whole frequency range uniformly looks at the thick low-frequency components and the delicate high-frequency components with the same weight.
BS-RoFormer's answer: band-split × hierarchical Transformer × RoPE
The abstract of BS-RoFormer (arXiv:2309.02612, Lu, Wang, Kong, Hung, ByteDance SAMI) succinctly expresses the design.
BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE).
It decomposes into three elements.
- Band-split: projects the complex spectrogram into per-frequency-band (subband) representations. The granularity can be varied to match the structure of frequency.
- Hierarchical Transformer: relates both inner-band and inter-band sequences with stacked Transformers. It takes both "the temporal variation within a band" and "the relationship between bands."
- RoPE (Rotary Position Embedding): a mechanism that gives position information to the Transformer via rotation. It handles positional relationships stably even for long sequences and aids separation learning.
This "structuring frequency by band and modeling inner/inter-band simultaneously with a Transformer" is the source of quality surpassing the conventional U-Net family (MDX-Net and Demucs).
Mel-Band RoFormer: bringing the band-cutting closer to "human hearing"
Mel-Band RoFormer (arXiv:2310.01809, Wang, Lu, Won) is a derivative that changes how BS-RoFormer band-splits. In the paper's words —
we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale.
Whereas BS-RoFormer's band-split was a heuristic non-overlapping one, Mel-Band RoFormer splits into overlapping bands along the Mel scale (the Mel scale = a frequency scale close to human hearing). The Transformer+RoPE backend is shared, and only the band-split at the entrance differs. The paper reports it "surpasses BS-RoFormer on vocals / drums / other" (the abstract gives no specific dB value).
🧩 Where the models come from: these RoFormer checkpoints are often trained/distributed in ZFTurbo's MSST framework (MIT) and the MVSep ecosystem, and can be downloaded directly from audio-separator. The RoFormer model code is widely used via lucidrains' reimplementation.
Track record: the numbers of SDX23 and MUSDB18HQ
RoFormer's strength is backed by the numbers of competitions and benchmarks (primary sources only).
| Metric | Value | Condition | Source |
|---|---|---|---|
| SDX23 MSS standard division | 1st (team SAMI-ByteDance = BS-RoFormer) | Arbitrary data allowed | SDX23 paper |
| BS-RoFormer small version | 9.80 dB SDR | MUSDB18HQ, no extra data | arXiv:2309.02612 |
| Reference: Demucs v4 (htdemucs) | 9.00 dB (base) / 9.20 dB (FT) | MUSDB HQ | Demucs |
⚠️ How to read the numbers: a 0.x–1 dB difference in SDR is, depending on the use, audible "once it's pointed out." RoFormer is indeed the highest quality, but whether that difference is worth the price of speed, VRAM, and cost depends on the use. How to objectively evaluate quality is in the quality-evaluation guide, and cross-model selection in the selection guide.
How to use: run RoFormer with audio-separator
The biggest good news is that RoFormer can be called from the same Separator API as MDX-Net and Demucs. What's more, since audio-separator's default model is BS-RoFormer, RoFormer is used automatically if you don't specify a model.
pip install "audio-separator[gpu]" # GPU。CPUは [cpu]
# roformer_quickstart.py — BS-RoFormerで最高品質の分離を行う
from audio_separator.separator import Separator
separator = Separator(output_dir="stems", output_format="flac") # 可逆で品質維持
# 既定がBS-RoFormer。明示するならファイル名を渡す
separator.load_model(model_filename="model_bs_roformer_ep_317_sdr_12.9755.ckpt")
output_files = separator.separate("song.wav")
print(output_files) # ['..._(Vocals)_....flac', '..._(Instrumental)_....flac']
Representative RoFormer models (always confirm what exists with audio-separator --list_models):
| Model filename | Lineage | Use |
|---|---|---|
model_bs_roformer_ep_317_sdr_12.9755.ckpt | BS-RoFormer | audio-separator default. General-purpose, high quality |
model_bs_roformer_ep_368_sdr_12.9628.ckpt | BS-RoFormer | A high-quality version of a different epoch |
vocals_mel_band_roformer.ckpt | Mel-Band RoFormer | Vocal extraction |
mel_band_roformer_kim_ft_unwa.ckpt | Mel-Band RoFormer | A vocal-specialized FT version |
# RoFormer系の"今この瞬間"の正確なファイル名を確認
audio-separator --list_models --list_filter roformer
🔧 Model names (especially the long ones with epoch/SDR) get updated in distribution. The names here are confirmed at writing time, but confirm what exists with
--list_models. For RoFormer's fine inference parameters, the iron rule is to first run with the official defaults as-is and adjust only the points you're dissatisfied with.
The price: the reality of VRAM and speed
RoFormer is not just a fast, light silver bullet. The following is community/practical knowledge (🟡 no unified official benchmark exists).
| Aspect | MDX-Net | Demucs v4 | BS/Mel-RoFormer |
|---|---|---|---|
| Quality | Mid–high | High | Highest |
| Speed | Fast | Mid | Slow |
| VRAM | Low | Mid | High |
Because it runs a large Transformer across band × time, it's the most computationally heavy, eats the most VRAM, and has the longest processing time — that's the price of quality. RoFormer is also the one most prone to hitting CUDA out of memory in long/bulk processing.
The basics of an OOM countermeasure is lowering segment_size (the chunk loaded onto the GPU). Including the trap where the GPU falls back to CPU and becomes "works but agonizingly slow," the sticking points are systematized in the troubleshooting guide.
When to use RoFormer and when to stay with MDX-Net
"Just always use the highest quality" is, in production, wrong from the standpoint of cost and throughput. Use them differently by purpose.
When you should use RoFormer:
- Top-quality final delivery (the hero cut of an upload, clean separation of commercial audio).
- Difficult material (dense mixes, songs where vocals and instruments overlap in close bands).
- When you can tolerate the processing time, VRAM, and cost.
When you should stay with MDX-Net:
- When you want to carve down the unit price in bulk batches (speed and VRAM efficiency matter).
- When you only have a low-VRAM environment.
- When the quality requirement is "sufficient" and need not be the highest.
In practice, a two-tier setup works — draft fast with MDX-Net for a full overview → final-generate only the adopted cuts with RoFormer. This balances quality and cost. For cross-model decision-making, see the selection guide, and for production scale, the GPU-worker platform article.
Frequently asked questions (FAQ)
Q. BS-RoFormer or Mel-Band RoFormer — which to use? A. First BS-RoFormer (audio-separator's default, with a rich record). Since the paper reports Mel-Band RoFormer surpasses on vocals/drums/other, if you want to dial in quality further for vocal extraction, also try the Mel-Band version and decide by objectively evaluating on your own audio.
Q. Is RoFormer always better than MDX-Net or Demucs? A. In quality (SDR) it's above, but at a disadvantage in speed and VRAM. Whether a 0.x–1 dB difference is worth the cost depends on the use. Use MDX-Net for bulk, RoFormer for top quality, and Demucs for 4 stems.
Q. How much GPU is needed?
A. Since RoFormer eats VRAM, the larger the GPU's VRAM, the more stable. If insufficient, lower segment_size. Handling CUDA out of memory is in the troubleshooting guide.
Q. Can it be used commercially? A. audio-separator is MIT, but each model weight's distribution terms are individual, and above all the copyright of the audio you process (commercial songs, etc.) is a separate matter. For commercial use, always confirm with primary sources.
Q. I was using it without knowing the default model is RoFormer.
A. Yes, audio-separator uses BS-RoFormer (model_bs_roformer_ep_317_sdr_12.9755) when no model is specified. If you want it lighter, explicitly specify MDX-Net like load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx").
Conclusion: leverage the highest quality at "the right spot"
RoFormer's essence is in structuring frequency by band and modeling inner-band/inter-band simultaneously with a Transformer and RoPE. That's exactly why it's the current SOTA, and exactly why it's heavy.
- If you need quality, RoFormer: the strength of SDX23 1st place / MUSDB18HQ 9.80 dB.
- Implementation is easy: audio-separator's default model. Just specify the filename in
load_model. - Understand the price: the slowest, most VRAM-hungry. Handle OOM with
segment_size. - Using them differently is the right answer: MDX-Net for bulk, RoFormer for top quality, Demucs for 4 stems.
"Knowing the highest-quality model" and "using the optimal model differently by purpose, getting quality while keeping cost down in production" are different things. The latter is where outsourcing makes a difference. For building and improving voice/video AI including source separation, consult me along with my track record. With one person × generative AI, I support end-to-end from design to production operation.
Sources / official resources
- BS-RoFormer: arXiv:2309.02612 (Music Source Separation with Band-Split RoPE Transformer, ByteDance SAMI, ICASSP 2024)
- Mel-Band RoFormer: arXiv:2310.01809 (Mel-Band RoFormer for Music Source Separation)
- SDX23 (Sound Demixing Challenge 2023): AIcrowd / TISMIR paper
- Training/inference framework: ZFTurbo/Music-Source-Separation-Training (MIT)
- Execution library: nomadkaraoke/python-audio-separator (MIT)
- Model names, SDR, and licenses get updated. Always confirm primary sources before implementation (especially model names with
audio-separator --list_models). The speed/VRAM trade-off is community knowledge; measuring on your own hardware is recommended.