Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

The goal of this article

As you research source-separation models, you always arrive at RoFormer. Not Demucs, not UVR5/MDX-Net, but the lineage said to have the highest separation quality as of 2026 — that's BS-RoFormer (Band-Split RoPE Transformer) and Mel-Band RoFormer. In fact, the default model of the standard library audio-separator is BS-RoFormer.

This piece explains this latest SOTA faithfully to the official papers (arXiv) and takes it to where you can use it in production starting today. When you finish reading, you'll be able to:

Explain to others why RoFormer is the highest quality (band-split × RoPE Transformer).
Run BS/Mel-Band RoFormer in audio-separator and confirm the quality on your own audio.
Judge when you should use RoFormer and when to stay with MDX-Net, given the reality of VRAM and speed.

About the author (reliability disclosure): I have single-handedly designed, implemented, and run in production an AI video-localization platform that has source separation as its first stage. I operate a design that switches models per material — "MDX-Net for speed, RoFormer for top quality" — and the choices here are knowledge from actually weighing cost against quality. For how to choose tools overall, see the selection guide; Demucs is collected in its dedicated guide.

30-second summary (conclusion first)

Aspect	Conclusion
What model	Source separation's current SOTA. Band-splits the complex spectrogram and estimates the mask with Transformer + RoPE
BS-RoFormer	Band-Split RoPE Transformer (arXiv:2309.02612, ByteDance, ICASSP2024)
Mel-Band RoFormer	Makes band-splitting into overlapping Mel-scale bands (arXiv:2310.01809). Surpasses BS on some stems
Record	1st in SDX23's standard MSS division, and even the small version 9.80 dB on MUSDB18HQ without extra data
Usage	audio-separator's default model. Just specify the filename in `load_model`
Price	The heaviest, slowest, most VRAM-hungry (community knowledge)
When to use	Top-quality final delivery, the hero cut of an upload, difficult material
When to stay	Bulk batches, low VRAM, speed/cost focus → MDX-Net
License	audio-separator is MIT. Confirm each model weight's distribution terms and the audio's copyright separately

Why RoFormer is the highest quality (paper-based, made simple)

This chapter breaks down the core of RoFormer while keeping accuracy. If you only want the implementation, feel free to jump to usage.

The starting point: see the spectrogram by "band"

The frequency-domain approach to source separation converts audio with STFT into a complex spectrogram (time × frequency) and estimates each component's mask. The problem is that treating the whole frequency range uniformly looks at the thick low-frequency components and the delicate high-frequency components with the same weight.

BS-RoFormer's answer: band-split × hierarchical Transformer × RoPE

The abstract of BS-RoFormer (arXiv:2309.02612, Lu, Wang, Kong, Hung, ByteDance SAMI) succinctly expresses the design.

BS-RoFormer relies on a band-split module to project the input complex spectrogram into subband-level representations, and then arranges a stack of hierarchical Transformers to model the inner-band as well as inter-band sequences for multi-band mask estimation. To facilitate training the model for MSS, we propose to use the Rotary Position Embedding (RoPE).

It decomposes into three elements.

Band-split: projects the complex spectrogram into per-frequency-band (subband) representations. The granularity can be varied to match the structure of frequency.
Hierarchical Transformer: relates both inner-band and inter-band sequences with stacked Transformers. It takes both "the temporal variation within a band" and "the relationship between bands."
RoPE (Rotary Position Embedding): a mechanism that gives position information to the Transformer via rotation. It handles positional relationships stably even for long sequences and aids separation learning.

This "structuring frequency by band and modeling inner/inter-band simultaneously with a Transformer" is the source of quality surpassing the conventional U-Net family (MDX-Net and Demucs).

Mel-Band RoFormer: bringing the band-cutting closer to "human hearing"

Mel-Band RoFormer (arXiv:2310.01809, Wang, Lu, Won) is a derivative that changes how BS-RoFormer band-splits. In the paper's words —

we propose Mel-RoFormer, which adopts the Mel-band scheme that maps the frequency bins into overlapped subbands according to the mel scale.

Whereas BS-RoFormer's band-split was a heuristic non-overlapping one, Mel-Band RoFormer splits into overlapping bands along the Mel scale (the Mel scale = a frequency scale close to human hearing). The Transformer+RoPE backend is shared, and only the band-split at the entrance differs. The paper reports it "surpasses BS-RoFormer on vocals / drums / other" (the abstract gives no specific dB value).

🧩 Where the models come from: these RoFormer checkpoints are often trained/distributed in ZFTurbo's MSST framework (MIT) and the MVSep ecosystem, and can be downloaded directly from audio-separator. The RoFormer model code is widely used via lucidrains' reimplementation.

Track record: the numbers of SDX23 and MUSDB18HQ

RoFormer's strength is backed by the numbers of competitions and benchmarks (primary sources only).

Metric	Value	Condition	Source
SDX23 MSS standard division	1st (team SAMI-ByteDance = BS-RoFormer)	Arbitrary data allowed	SDX23 paper
BS-RoFormer small version	9.80 dB SDR	MUSDB18HQ, no extra data	arXiv:2309.02612
Reference: Demucs v4 (htdemucs)	9.00 dB (base) / 9.20 dB (FT)	MUSDB HQ	Demucs

⚠️ How to read the numbers: a 0.x–1 dB difference in SDR is, depending on the use, audible "once it's pointed out." RoFormer is indeed the highest quality, but whether that difference is worth the price of speed, VRAM, and cost depends on the use. How to objectively evaluate quality is in the quality-evaluation guide, and cross-model selection in the selection guide.

How to use: run RoFormer with audio-separator

The biggest good news is that RoFormer can be called from the same Separator API as MDX-Net and Demucs. What's more, since audio-separator's default model is BS-RoFormer, RoFormer is used automatically if you don't specify a model.

pip install "audio-separator[gpu]"   # GPU。CPUは [cpu]

# roformer_quickstart.py — BS-RoFormerで最高品質の分離を行う
from audio_separator.separator import Separator

separator = Separator(output_dir="stems", output_format="flac")  # 可逆で品質維持

# 既定がBS-RoFormer。明示するならファイル名を渡す
separator.load_model(model_filename="model_bs_roformer_ep_317_sdr_12.9755.ckpt")

output_files = separator.separate("song.wav")
print(output_files)   # ['..._(Vocals)_....flac', '..._(Instrumental)_....flac']

Representative RoFormer models (always confirm what exists with audio-separator --list_models):

Model filename	Lineage	Use
`model_bs_roformer_ep_317_sdr_12.9755.ckpt`	BS-RoFormer	audio-separator default. General-purpose, high quality
`model_bs_roformer_ep_368_sdr_12.9628.ckpt`	BS-RoFormer	A high-quality version of a different epoch
`vocals_mel_band_roformer.ckpt`	Mel-Band RoFormer	Vocal extraction
`mel_band_roformer_kim_ft_unwa.ckpt`	Mel-Band RoFormer	A vocal-specialized FT version

# RoFormer系の"今この瞬間"の正確なファイル名を確認
audio-separator --list_models --list_filter roformer

🔧 Model names (especially the long ones with epoch/SDR) get updated in distribution. The names here are confirmed at writing time, but confirm what exists with --list_models. For RoFormer's fine inference parameters, the iron rule is to first run with the official defaults as-is and adjust only the points you're dissatisfied with.

The price: the reality of VRAM and speed

RoFormer is not just a fast, light silver bullet. The following is community/practical knowledge (🟡 no unified official benchmark exists).

Aspect	MDX-Net	Demucs v4	BS/Mel-RoFormer
Quality	Mid–high	High	Highest
Speed	Fast	Mid	Slow
VRAM	Low	Mid	High

Because it runs a large Transformer across band × time, it's the most computationally heavy, eats the most VRAM, and has the longest processing time — that's the price of quality. RoFormer is also the one most prone to hitting CUDA out of memory in long/bulk processing.

The basics of an OOM countermeasure is lowering segment_size (the chunk loaded onto the GPU). Including the trap where the GPU falls back to CPU and becomes "works but agonizingly slow," the sticking points are systematized in the troubleshooting guide.

When to use RoFormer and when to stay with MDX-Net

"Just always use the highest quality" is, in production, wrong from the standpoint of cost and throughput. Use them differently by purpose.

When you should use RoFormer:

Top-quality final delivery (the hero cut of an upload, clean separation of commercial audio).
Difficult material (dense mixes, songs where vocals and instruments overlap in close bands).
When you can tolerate the processing time, VRAM, and cost.

When you should stay with MDX-Net:

When you want to carve down the unit price in bulk batches (speed and VRAM efficiency matter).
When you only have a low-VRAM environment.
When the quality requirement is "sufficient" and need not be the highest.

In practice, a two-tier setup works — draft fast with MDX-Net for a full overview → final-generate only the adopted cuts with RoFormer. This balances quality and cost. For cross-model decision-making, see the selection guide, and for production scale, the GPU-worker platform article.

Frequently asked questions (FAQ)

Q. BS-RoFormer or Mel-Band RoFormer — which to use? A. First BS-RoFormer (audio-separator's default, with a rich record). Since the paper reports Mel-Band RoFormer surpasses on vocals/drums/other, if you want to dial in quality further for vocal extraction, also try the Mel-Band version and decide by objectively evaluating on your own audio.

Q. Is RoFormer always better than MDX-Net or Demucs? A. In quality (SDR) it's above, but at a disadvantage in speed and VRAM. Whether a 0.x–1 dB difference is worth the cost depends on the use. Use MDX-Net for bulk, RoFormer for top quality, and Demucs for 4 stems.

Q. How much GPU is needed? A. Since RoFormer eats VRAM, the larger the GPU's VRAM, the more stable. If insufficient, lower segment_size. Handling CUDA out of memory is in the troubleshooting guide.

Q. Can it be used commercially? A. audio-separator is MIT, but each model weight's distribution terms are individual, and above all the copyright of the audio you process (commercial songs, etc.) is a separate matter. For commercial use, always confirm with primary sources.

Q. I was using it without knowing the default model is RoFormer. A. Yes, audio-separator uses BS-RoFormer (model_bs_roformer_ep_317_sdr_12.9755) when no model is specified. If you want it lighter, explicitly specify MDX-Net like load_model(model_filename="UVR-MDX-NET-Inst_HQ_3.onnx").

Conclusion: leverage the highest quality at "the right spot"

RoFormer's essence is in structuring frequency by band and modeling inner-band/inter-band simultaneously with a Transformer and RoPE. That's exactly why it's the current SOTA, and exactly why it's heavy.

If you need quality, RoFormer: the strength of SDX23 1st place / MUSDB18HQ 9.80 dB.
Implementation is easy: audio-separator's default model. Just specify the filename in load_model.
Understand the price: the slowest, most VRAM-hungry. Handle OOM with segment_size.
Using them differently is the right answer: MDX-Net for bulk, RoFormer for top quality, Demucs for 4 stems.

"Knowing the highest-quality model" and "using the optimal model differently by purpose, getting quality while keeping cost down in production" are different things. The latter is where outsourcing makes a difference. For building and improving voice/video AI including source separation, consult me along with my track record. With one person × generative AI, I support end-to-end from design to production operation.

Sources / official resources

BS-RoFormer: arXiv:2309.02612 (Music Source Separation with Band-Split RoPE Transformer, ByteDance SAMI, ICASSP 2024)
Mel-Band RoFormer: arXiv:2310.01809 (Mel-Band RoFormer for Music Source Separation)
SDX23 (Sound Demixing Challenge 2023): AIcrowd / TISMIR paper
Training/inference framework: ZFTurbo/Music-Source-Separation-Training (MIT)
Execution library: nomadkaraoke/python-audio-separator (MIT)

Model names, SDR, and licenses get updated. Always confirm primary sources before implementation (especially model names with audio-separator --list_models). The speed/VRAM trade-off is community knowledge; measuring on your own hardware is recommended.

Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production

The goal of this article

30-second summary (conclusion first)

Why RoFormer is the highest quality (paper-based, made simple)

The starting point: see the spectrogram by "band"

BS-RoFormer's answer: band-split × hierarchical Transformer × RoPE

Mel-Band RoFormer: bringing the band-cutting closer to "human hearing"

Track record: the numbers of SDX23 and MUSDB18HQ

How to use: run RoFormer with audio-separator

The price: the reality of VRAM and speed

When to use RoFormer and when to stay with MDX-Net

Frequently asked questions (FAQ)

Conclusion: leverage the highest quality at "the right spot"

Sources / official resources

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Turning source separation into a production API: the design of GPU worker × job queue × idempotency

Also worth reading

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

Self-hosting Llama in production with vLLM: a high-throughput inference-server operations log

LatentSync Complete Guide: Running ByteDance's Diffusion Lip-Sync Model in Production, Faithful to the Official Docs

The goal of this article

30-second summary (conclusion first)

Why RoFormer is the highest quality (paper-based, made simple)

The starting point: see the spectrogram by "band"

BS-RoFormer's answer: band-split × hierarchical Transformer × RoPE

Mel-Band RoFormer: bringing the band-cutting closer to "human hearing"

Track record: the numbers of SDX23 and MUSDB18HQ

How to use: run RoFormer with audio-separator

The price: the reality of VRAM and speed

When to use RoFormer and when to stay with MDX-Net

Frequently asked questions (FAQ)

Conclusion: leverage the highest quality at "the right spot"

Sources / official resources

Related articles

How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements

Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

Turning source separation into a production API: the design of GPU worker × job queue × idempotency

Also worth reading

Practical Llama fine-tuning: specializing to your own data with LoRA/QLoRA and putting it into production

Self-hosting Llama in production with vLLM: a high-throughput inference-server operations log

LatentSync Complete Guide: Running ByteDance's Diffusion Lip-Sync Model in Production, Faithful to the Official Docs