Skip to main content
友田 陽大
Audio source separation & preprocessing
BS-RoFormer
Mel-Band RoFormer
音源分離
ボーカル抽出
AI音声
Python
生成AI
MLOps

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

An explanation, faithful to the official papers, of source separation's current SOTA: BS-RoFormer (Band-Split RoPE Transformer) and Mel-Band RoFormer. It shows the implementation needed for production: why it's the highest quality (band-split × RoPE Transformer), the SDX23 1st-place / MUSDB18HQ 9.80 dB track record, the execution code in audio-separator, the reality of VRAM/speed and OOM countermeasures, and choosing between it and MDX-Net.

Published
Reading time
9 min read
Author
友田 陽大
Share

The goal of this article

As you research source-separation models, you always arrive at RoFormer. Not Demucs, not UVR5/MDX-Net, but the lineage said to have the highest separation quality as of 2026 — that's BS-RoFormer (Band-Split RoPE Transformer) and Mel-Band RoFormer. In fact, the default model of the standard library audio-separator is BS-RoFormer.

This piece explains this latest SOTA faithfully to the official papers (arXiv) and takes it to where you can use it in production starting today. When you finish reading, you'll be able to:

  1. Explain to others why RoFormer is the highest quality (band-split × RoPE Transformer).
  2. Run BS/Mel-Band RoFormer in audio-separator and confirm the quality on your own audio.
  3. Judge when you should use RoFormer and when to stay with MDX-Net, given the reality of VRAM and speed.

About the author (reliability disclosure): I have single-handedly designed, implemented, and run in production an AI video-localization platform that has source separation as its first stage. I operate a design that switches models per material — "MDX-Net for speed, RoFormer for top quality" — and the choices here are knowledge from actually weighing cost against quality. For how to choose tools overall, see the selection guide; Demucs is collected in its dedicated guide.


30-second summary (conclusion first)

AspectConclusion
What modelSource separation's current SOTA. Band-splits the complex spectrogram and estimates the mask with Transformer + RoPE
BS-RoFormerBand-Split RoPE Transformer (arXiv:2309.02612, ByteDance, ICASSP2024)
Mel-Band RoFormerMakes band-splitting into overlapping Mel-scale bands (arXiv:2310.01809). Surpasses BS on some stems
Record1st in SDX23's standard MSS division, and even the small version 9.80 dB on MUSDB18HQ without extra data
Usageaudio-separator's default model. Just specify the filename in load_model
PriceThe heaviest, slowest, most VRAM-hungry (community knowledge)
When to useTop-quality final delivery, the hero cut of an upload, difficult material
When to stayBulk batches, low VRAM, speed/cost focus → MDX-Net
Licenseaudio-separator is MIT. Confirm each model weight's distribution terms and the audio's copyright separately

Why RoFormer is the highest quality (paper-based, made simple)

This chapter breaks down the core of RoFormer while keeping accuracy. If you only want the implementation, feel free to jump to usage.

The starting point: see the spectrogram by "band"

The frequency-domain approach to source separation converts audio with STFT into a complex spectrogram (time × frequency) and estimates each component's mask. The problem is that treating the whole frequency range uniformly looks at the thick low-frequency components and the delicate high-frequency components with the same weight.

BS-RoFormer's answer: band-split × hierarchical Transformer × RoPE

The abstract of BS-RoFormer (arXiv:2309.02612, Lu, Wang, Kong, Hung, ByteDance SAMI) succinctly expresses the design.

BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE).

It decomposes into three elements.

  1. Band-split: projects the complex spectrogram into per-frequency-band (subband) representations. The granularity can be varied to match the structure of frequency.
  2. Hierarchical Transformer: relates both inner-band and inter-band sequences with stacked Transformers. It takes both "the temporal variation within a band" and "the relationship between bands."
  3. RoPE (Rotary Position Embedding): a mechanism that gives position information to the Transformer via rotation. It handles positional relationships stably even for long sequences and aids separation learning.

This "structuring frequency by band and modeling inner/inter-band simultaneously with a Transformer" is the source of quality surpassing the conventional U-Net family (MDX-Net and Demucs).

Mel-Band RoFormer: bringing the band-cutting closer to "human hearing"

Mel-Band RoFormer (arXiv:2310.01809, Wang, Lu, Won) is a derivative that changes how BS-RoFormer band-splits. In the paper's words —

we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale.

Whereas BS-RoFormer's band-split was a heuristic non-overlapping one, Mel-Band RoFormer splits into overlapping bands along the Mel scale (the Mel scale = a frequency scale close to human hearing). The Transformer+RoPE backend is shared, and only the band-split at the entrance differs. The paper reports it "surpasses BS-RoFormer on vocals / drums / other" (the abstract gives no specific dB value).

🧩 Where the models come from: these RoFormer checkpoints are often trained/distributed in ZFTurbo's MSST framework (MIT) and the MVSep ecosystem, and can be downloaded directly from audio-separator. The RoFormer model code is widely used via lucidrains' reimplementation.


Track record: the numbers of SDX23 and MUSDB18HQ

RoFormer's strength is backed by the numbers of competitions and benchmarks (primary sources only).

MetricValueConditionSource
SDX23 MSS standard division1st (team SAMI-ByteDance = BS-RoFormer)Arbitrary data allowedSDX23 paper
BS-RoFormer small version9.80 dB SDRMUSDB18HQ, no extra dataarXiv:2309.02612
Reference: Demucs v4 (htdemucs)9.00 dB (base) / 9.20 dB (FT)MUSDB HQDemucs

⚠️ How to read the numbers: a 0.x–1 dB difference in SDR is, depending on the use, audible "once it's pointed out." RoFormer is indeed the highest quality, but whether that difference is worth the price of speed, VRAM, and cost depends on the use. How to objectively evaluate quality is in the quality-evaluation guide, and cross-model selection in the selection guide.


How to use: run RoFormer with audio-separator

The biggest good news is that RoFormer can be called from the same Separator API as MDX-Net and Demucs. What's more, since audio-separator's default model is BS-RoFormer, RoFormer is used automatically if you don't specify a model.

pip install "audio-separator[gpu]"   # GPU。CPUは [cpu]
# roformer_quickstart.py — BS-RoFormerで最高品質の分離を行う
from audio_separator.separator import Separator

separator = Separator(output_dir="stems", output_format="flac")  # 可逆で品質維持

# 既定がBS-RoFormer。明示するならファイル名を渡す
separator.load_model(model_filename="model_bs_roformer_ep_317_sdr_12.9755.ckpt")

output_files = separator.separate("song.wav")
print(output_files)   # ['..._(Vocals)_....flac', '..._(Instrumental)_....flac']

Representative RoFormer models (always confirm what exists with audio-separator --list_models):

Model filenameLineageUse
model_bs_roformer_ep_317_sdr_12.9755.ckptBS-RoFormeraudio-separator default. General-purpose, high quality
model_bs_roformer_ep_368_sdr_12.9628.ckptBS-RoFormerA high-quality version of a different epoch
vocals_mel_band_roformer.ckptMel-Band RoFormerVocal extraction
mel_band_roformer_kim_ft_unwa.ckptMel-Band RoFormerA vocal-specialized FT version
# RoFormer系の"今この瞬間"の正確なファイル名を確認
audio-separator --list_models --list_filter roformer

🔧 Model names (especially the long ones with epoch/SDR) get updated in distribution. The names here are confirmed at writing time, but confirm what exists with --list_models. For RoFormer's fine inference parameters, the iron rule is to first run with the official defaults as-is and adjust only the points you're dissatisfied with.


The price: the reality of VRAM and speed

RoFormer is not just a fast, light silver bullet. The following is community/practical knowledge (🟡 no unified official benchmark exists).

AspectMDX-NetDemucs v4BS/Mel-RoFormer
QualityMid–highHighHighest
SpeedFastMidSlow
VRAMLowMidHigh

Because it runs a large Transformer across band × time, it's the most computationally heavy, eats the most VRAM, and has the longest processing time — that's the price of quality. RoFormer is also the one most prone to hitting CUDA out of memory in long/bulk processing.

The basics of an OOM countermeasure is lowering segment_size (the chunk loaded onto the GPU). Including the trap where the GPU falls back to CPU and becomes "works but agonizingly slow," the sticking points are systematized in the troubleshooting guide.


When to use RoFormer and when to stay with MDX-Net

"Just always use the highest quality" is, in production, wrong from the standpoint of cost and throughput. Use them differently by purpose.

When you should use RoFormer:

  • Top-quality final delivery (the hero cut of an upload, clean separation of commercial audio).
  • Difficult material (dense mixes, songs where vocals and instruments overlap in close bands).
  • When you can tolerate the processing time, VRAM, and cost.

When you should stay with MDX-Net:

  • When you want to carve down the unit price in bulk batches (speed and VRAM efficiency matter).
  • When you only have a low-VRAM environment.
  • When the quality requirement is "sufficient" and need not be the highest.

In practice, a two-tier setup works — draft fast with MDX-Net for a full overview → final-generate only the adopted cuts with RoFormer. This balances quality and cost. For cross-model decision-making, see the selection guide, and for production scale, the GPU-worker platform article.


Frequently asked questions (FAQ)

Q. BS-RoFormer or Mel-Band RoFormer — which to use? A. First BS-RoFormer (audio-separator's default, with a rich record). Since the paper reports Mel-Band RoFormer surpasses on vocals/drums/other, if you want to dial in quality further for vocal extraction, also try the Mel-Band version and decide by objectively evaluating on your own audio.

Q. Is RoFormer always better than MDX-Net or Demucs? A. In quality (SDR) it's above, but at a disadvantage in speed and VRAM. Whether a 0.x–1 dB difference is worth the cost depends on the use. Use MDX-Net for bulk, RoFormer for top quality, and Demucs for 4 stems.

Q. How much GPU is needed? A. Since RoFormer eats VRAM, the larger the GPU's VRAM, the more stable. If insufficient, lower segment_size. Handling CUDA out of memory is in the troubleshooting guide.

Q. Can it be used commercially? A. audio-separator is MIT, but each model weight's distribution terms are individual, and above all the copyright of the audio you process (commercial songs, etc.) is a separate matter. For commercial use, always confirm with primary sources.

Q. I was using it without knowing the default model is RoFormer. A. Yes, audio-separator uses BS-RoFormer (model_bs_roformer_ep_317_sdr_12.9755) when no model is specified. If you want it lighter, explicitly specify MDX-Net like load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx").


Conclusion: leverage the highest quality at "the right spot"

RoFormer's essence is in structuring frequency by band and modeling inner-band/inter-band simultaneously with a Transformer and RoPE. That's exactly why it's the current SOTA, and exactly why it's heavy.

  1. If you need quality, RoFormer: the strength of SDX23 1st place / MUSDB18HQ 9.80 dB.
  2. Implementation is easy: audio-separator's default model. Just specify the filename in load_model.
  3. Understand the price: the slowest, most VRAM-hungry. Handle OOM with segment_size.
  4. Using them differently is the right answer: MDX-Net for bulk, RoFormer for top quality, Demucs for 4 stems.

"Knowing the highest-quality model" and "using the optimal model differently by purpose, getting quality while keeping cost down in production" are different things. The latter is where outsourcing makes a difference. For building and improving voice/video AI including source separation, consult me along with my track record. With one person × generative AI, I support end-to-end from design to production operation.


Sources / official resources

  • Model names, SDR, and licenses get updated. Always confirm primary sources before implementation (especially model names with audio-separator --list_models). The speed/VRAM trade-off is community knowledge; measuring on your own hardware is recommended.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading