Skip to main content
友田 陽大
Audio source separation & preprocessing
Demucs
音源分離
音声処理
Python
GPU
生成AI
MLOps

Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

An explanation of Meta's source-separation model Demucs v4 (HT Demucs), faithful to the official documentation (GitHub, paper). The mechanism of waveform × spectrogram × Transformer, how to choose among the htdemucs-family models, CLI and Python API implementation, real recipes for vocal separation / karaoke / ASR preprocessing / video localization, and long-audio OOM, idempotency, and resilience — the production-operations design shown with concrete code.

Published
Reading time
25 min read
Author
友田 陽大
Share
Contents

The Goal of This Article

Demucs is a music source separation (MSS) model, published by Meta (formerly Facebook AI Research), that decomposes a single track or piece of audio into "vocals / drums / bass / other." Its 4th generation, Demucs v4 = HT Demucs (Hybrid Transformer Demucs), was proposed in the paper Hybrid Transformers for Music Source Separation (ICASSP 2023) and achieves SOTA-class separation quality as a public model.

This article, while strictly based on the official documentation (GitHub / docs/api.md / paper), fills in — with actually-running code — "in which scene, how to use it, and where it clogs," which isn't written in the official README. By the time you finish reading, I aim for a state where you can do the following 3 things.

  1. Explain to others what kind of model Demucs v4 is, and why its quality is high.
  2. Use the CLI (to try) and the Python API (to embed) as appropriate, and get your hands moving today.
  3. Assemble a resilient implementation that withstands not a demo but production — long-audio memory exhaustion, ffmpeg dependence, double-processing.

About the author (disclosure for credibility): I have single-handedly designed, implemented, and run in production an AI video-localization platform that fully automates, from just uploading a video, everything from "audio separation → transcription → translation → multilingual dubbing → lip-sync." Its first stage (audio separation) — separating "human voice" from "BGM / sound effects" in the source video, and swapping in a different-language narration while keeping the BGM — is handled precisely by the Demucs of this article. This article's "pitfalls" and "resilience design" are not demo knowledge but a record of the mines I stepped on in that real operation. The whole pipeline's design is in a separate article, and the project overview is compiled at the portfolio link at the end of this article.


A 30-Second Summary (Conclusion First)

ViewpointConclusion
What kind of modelA source-separation model that decomposes a single piece of audio into 4 stems — "vocals / drums / bass / other"
What's impressiveProcesses both waveform and spectrogram in parallel, bridging them with a Transformer. SOTA-class among public models with 9.0 dB SDR on MUSDB HQ (9.20 dB with the fine-tuned htdemucs_ft)
Generationv4 = HT Demucs. The paper is ICASSP 2023 (Rouard, Massa, Défossez). The current stable latest
Ease of installationA single line, pip install -U demucs. Runs even on CPU (about 1.5× real time). GPU from 3GB VRAM
LicenseMIT. Commercial use OK (but the master rights / copyright of the separated audio are a separate matter)
Just to tryOne CLI shot: demucs song.mp3. Output is separated/htdemucs/song/{vocals,drums,bass,other}.wav
To build inPython API: receive tensors with demucs.api.Separator and embed them in your own pipeline
The main quality knobsshifts (averaging predictions for +up to 0.2dB, N× slower) / overlap (default 0.25) / segment (memory and quality)
Suited usesKaraoke/vocal removal, ASR preprocessing (improving Whisper accuracy), keeping BGM in video localization, ear-copying / education, remixing
Unsuited usesStrict zero-latency live streaming (basically offline processing)

If you want to "first check the quality on your own audio," jump straight to Usage A: First, Try It (CLI One-Liner) right after this. Including pip install, you'll get a result in a few minutes.


What Kind of Model Demucs v4 Is

The input is a single audio file (a track, narration, recorded stream, etc.). The output is 4 stemsvocals, drums, bass, and other — all written out as the same 44.1kHz stereo wav as the original. Choose htdemucs_6s and guitar and piano are added here for 6 stems.

This works in scenes like the following, for example.

  • Karaoke / vocal removal: --two-stems=vocals splits into 2 — "vocals" and "no_vocals (accompaniment)." Off-vocal tracks for sing-along covers and practice karaoke can be made in one command.
  • ASR (transcription) preprocessing: extracting only the human voice and then passing it to Whisper from audio carrying BGM or noise raises transcription accuracy under music or noise (see the recipe later). A standard technique to combine with the Whisper article.
  • Video localization (dubbing): split the source video's audio into "narration" and "BGM / sound effects," and swap in a different-language narration while keeping the BGM. The immersion is on a different level from subtitles or a flat replacement, raising the CV of overseas expansion. The first stage of my platform is this.
  • Ear-copying / music education: extract only the bass or only the drums to use for copying, transcription, and rhythm practice.
  • Remix / sampling / DJ: extract stems and reconstruct. Extracting a cappella or instrumental material.

Conversely, it's unsuited to strict real-time streaming (uses needing a response within tens of ms). Demucs is basically offline processing that handles one piece at a time in bulk, and on CPU it takes roughly 1.5× the length of the source (GPU is far faster). For live use, consider a separate lightweight model.


The Mechanism: Why "Waveform × Spectrogram × Transformer" (Made Gentle, Faithful to the Paper)

This chapter breaks down the core of the official paper while keeping it accurate. Those who want only the implementation may jump to how to choose a model. That said, "why the pitfalls appear in that shape" makes sense once you understand here.

Source Separation Has "Two Ways of Seeing"

When you hand sound to a machine, there are broadly 2 ways of representing it.

  • Waveform (temporal): the time axis itself. Strong on temporal details like the sharpness of an attack and phase.
  • Spectrogram (spectral / STFT): an image of frequency × time. Closer to how a human "tells instruments apart" — what is sounding in which frequency band.

Past source-separation models leaned to one or the other. Demucs v3 (Hybrid Demucs) unified both into one net, and v4 layered a Transformer on top of that.

The Structure of v4: Hybrid bi-U-Net + cross-domain Transformer

Breaking down the definition in the paper (arXiv:2211.08553) as-is, HT Demucs has this shape.

  • It runs 2 parallel encoders. One processes the waveform, the other the spectrogram (a temporal/spectral bi-U-Net).
  • It replaces the innermost layers with a cross-domain Transformer Encoder.
  • The Transformer integrates information with self-attention within the same domain (waveform with waveform) and cross-attention across domains (waveform ↔ spectrogram).

Intuitively, it has both "an ear that listens in time" and "an ear that listens in frequency" at once, and uses the Transformer's attention mechanism to cross-check the long-range context of both to judge which sound belongs to which stem. The question the paper posed was "is long-range context useful for source separation, or are local acoustic features enough?" and the answer was "useful."

Quality: 9.0–9.20 dB SDR on MUSDB HQ

Separation quality is measured by SDR (Signal-to-Distortion Ratio; higher is better). The official reports are as follows.

ModelSDR (MUSDB HQ)Notes
Hybrid Demucs (v3)about 7.7 dBNo extra data
HT Demucs (v4, htdemucs)9.0 dBTrained on MusDB + 800 songs
HT Demucs f.t. (v4, htdemucs_ft)9.20 dBsparse attention + per-source fine-tuning. SOTA-class among public models

The paper reports that, when trained with 800 additional songs, HT Demucs surpassed Hybrid Demucs by 0.45 dB, and further reached 9.20 dB by adding sparse attention (expanding the receptive field with a sparse attention mechanism) and per-source fine-tuning.

This Mechanism Decides "the Shape of the Pitfalls"

Pin down the structure, and the troubles described later turn out to be inevitable.

  • A Transformer model has large memory consumption per segment, and htdemucs has the constraint of a maximum of 7.8 seconds per segment. So long audio is automatically split into segments, and the memory pressure is decided by segment, jobs, and shifts.
  • Carrying both waveform and spectrogram makes the computation heavy, so it's slow on CPU (GPU recommended, but it runs from 3GB).
  • The quality knob (shifts) averages predictions multiple times, so quality and time trade off cleanly.

In other words, the parameters are not knobs to turn by "mood"; they are knobs that hold meaning along the extension of this design.


How to Choose a Model: htdemucs, htdemucs_ft, htdemucs_6s, mdx

Demucs bundles multiple pretrained models, chosen with -n (CLI) or model= (API). The choice in practice is simple.

Model nameStemsSpeedQualityChoose it thus
htdemucs (default)4 (vocals/drums/bass/other)StandardHighJust go with this first. The v4 standard. Sufficient for many uses
htdemucs_ft4about 4× slowerHighestQuality first. A configuration bundling models fine-tuned specially for each source (official: "4 times more time but might be a bit better")
htdemucs_6s6 (+guitar/piano)StandardHigh (piano is weak)When you also want to split guitar / piano. But the official explicitly states piano quality is "not good yet"
hdemucs_mmi4Fast-ishMedium–highA retrained version of v3 (Hybrid Demucs). When you want it light, without the Transformer
mdx / mdx_extra4Fast-ishMedium–highFor the old MDX challenge. mdx_extra includes extra data
mdx_q / mdx_extra_q4FastMediumQuantized versions. When you want to keep memory / size down

Shortcuts for the decision:

  • You want the best standard for nowhtdemucs (default). No need to specify.
  • You want the last few % of quality (deliverables, hero cuts) → htdemucs_ft. In exchange for 4× the time.
  • You also want to split guitar/pianohtdemucs_6s. But don't over-expect the piano.
  • Memory/size is tight (edge, large batches) → the quantized mdx_q / mdx_extra_q.

💡 The reason htdemucs_ft is "4× slower" is that it internally runs multiple per-source fine-tuned models in sequence. The return value of demucs.api.list_models() being split into {"single": [...], "bag": [...]} is for this reason, and htdemucs_ft belongs to the bag (bundle of models) side. Since the quality gain is "slight," the two-tier setup of drafts in htdemucs, deliverables only in htdemucs_ft is the cost-efficient usage.


Usage A: First, Try It (CLI One-Liner)

Demucs's biggest strength is that it's lightweight to install. Unlike a heavy diffusion model like LatentSync, it works even without a GPU (just slowly). First, feel the quality on your own audio.

1. Install

# Python 3.8+ が前提。仮想環境を切ってから入れるのが安全
python3 -m pip install -U demucs

# mp3 など wav 以外を扱うなら ffmpeg も入れる(Windowsは特に必須)
#   macOS:  brew install ffmpeg
#   Ubuntu: sudo apt install ffmpeg

2. Separate (One Command)

# 既定モデル(htdemucs)で4ステムに分離
demucs song.mp3

# 出力はこの構成で書き出される:
# separated/htdemucs/song/
#   ├── drums.wav
#   ├── bass.wav
#   ├── other.wav
#   └── vocals.wav

3. Frequently Used Patterns

# ① カラオケ/ボーカル除去: vocals と no_vocals の2本だけ作る
demucs --two-stems=vocals song.mp3
#   → separated/htdemucs/song/vocals.wav, no_vocals.wav

# ② GPUを明示(既定は自動。CPUを強制したいなら -d cpu)
demucs -d cuda song.mp3

# ③ 最高品質モデル + mp3で書き出し(320kbps)
demucs -n htdemucs_ft --mp3 --mp3-bitrate 320 song.mp3

# ④ 出力先を指定 + 複数ファイルを一括
demucs -o ./out track1.wav track2.flac track3.mp3

# ⑤ 6ステム(ギター/ピアノ込み)
demucs -n htdemucs_6s song.mp3

demucs --help shows all flags. The main ones are as follows.

FlagRoleDefault
-n MODELModel selectionhtdemucs
--two-stems=vocalsNarrow to 2 stems (the specified source / everything else)off (4 stems)
-d {cpu,cuda}Compute deviceauto
--mp3 / --flacOutput format (default is 16bit wav)wav
--mp3-bitratemp3 bitrate (kbps)320
--mp3-presetmp3 quality (2=best to 7=fastest)
--int24 / --float32Save wav as 24bit int / 32bit float16bit
--shifts NAverage N times with random time shifts (quality↑, N× slower)1
--overlapSegment overlap ratio0.25
--segment SECSegment length (htdemucs is max 7.8)model default
-j JOBSNumber of parallel jobs (faster but memory↑)1
--clip-mode {rescale,clamp}Clipping handlingrescale
-o DIROutput directoryseparated

Usage B: Embed via the Python API

To embed in your own server or batch, the Python API (demucs.api) is better than shelling out to the CLI. Because you can receive tensors directly, you can connect them straight to downstream stages (mixing, encoding, uploading).

Minimal Setup

Use the API of the official docs/api.md as-is.

import demucs.api

# Separatorを1度だけ生成してモデルをロード(使い回す)
separator = demucs.api.Separator(model="htdemucs")

# 分離: 戻り値は (元の波形, {ステム名: テンソル}) のタプル
origin, stems = separator.separate_audio_file("song.mp3")
# stems = {"drums": Tensor, "bass": Tensor, "other": Tensor, "vocals": Tensor}

# 各ステムを書き出す
for name, source in stems.items():
    demucs.api.save_audio(source, f"{name}.wav", samplerate=separator.samplerate)

The Separator constructor arguments (faithful to the official docs):

ArgumentMeaning
model="htdemucs"Model name
repo=NoneLocal model storage location (for offline operation)
device=None"cuda" / "cpu" / torch.device. None is auto
shifts=None>0 for random-shift averaging (SDR +up to 0.2, slower accordingly)
overlap=NoneSegment overlap ratio (default 0.25-equivalent)
split=NoneWhether to chunk long audio
segment=NoneSegment length (seconds). htdemucs is max 7.8
jobs=NoneNumber of parallel jobs
progress=FalseShow a progress bar
callback=NoneA function called at chunk start/end (used for progress / observation)

Make Karaoke (vocals / no_vocals) via the API

To do the CLI's --two-stems equivalent via the API, sum the non-vocals.

import demucs.api

separator = demucs.api.Separator(model="htdemucs")
origin, stems = separator.separate_audio_file("song.mp3")

# ボーカル以外を足し合わせれば「伴奏(no_vocals)」になる
no_vocals = sum(src for name, src in stems.items() if name != "vocals")

demucs.api.save_audio(stems["vocals"], "vocals.wav", samplerate=separator.samplerate)
demucs.api.save_audio(no_vocals, "no_vocals.wav", samplerate=separator.samplerate)

Observe Progress (callback)

For long audio, you want to show the user progress. You can pick up chunk progress with callback.

def on_progress(data: dict) -> None:
    # data には現在のセグメント番号・総数・モデル状態などが入る
    # 本番では構造化ログ/メトリクスに流す(標準出力にPIIを出さない)
    if data.get("state") == "end":
        print(f"segment {data.get('segment_offset')} done")

separator = demucs.api.Separator(model="htdemucs", callback=on_progress, progress=True)

Practical Parameter Tuning: Presets by Use

The official docs show the existence of the knobs and their defaults. Here I add the right values in practice. The table below is a ready-to-use preset.

UseModelshiftsoverlapsegmentOutputAim
Draft check (internal review)htdemucs10.25defaultmp3 320See the shape fast and cheap
Standard delivery (distribution/PR)htdemucs20.25defaultwavThe best point of quality and speed
Highest quality (hero / mastering)htdemucs_ft50.5defaultfloat32Details first (time is several ×)
Memory-saving (small VRAM/CPU)mdx_q10.1smallermp3Avoid OOM, prioritize finishing
ASR preprocessing (to Whisper)htdemucs10.25defaultwav (vocals only)Speed first, need only the voice

Rephrasing the meaning of the knobs in practical terms:

  • shifts (time-shift averaging): shift the input little by little and predict N times and average. Quality rises (official: up to +0.2 dB SDR), but time is N×. The standard is 2–5 for deliverables only, 1 for drafts.
  • overlap (segment overlap): default 0.25. Raise it and seam artifacts decrease but it gets slower. Set to 0.5 only for material where the seams bother you.
  • segment (the length of one interval): the main knob for memory and quality. htdemucs is max 7.8 seconds. Lower it if OOM occurs. Lower it too much and the seams increase, so first the default → shrink gradually if it fails.
  • -j / jobs (parallelism): you can process multiple songs at once and go faster, but memory increases by the parallelism. Consult your VRAM/RAM.
  • --mp3-preset: 2 (highest quality, slow) to 7 (fastest, low quality). 2–3 for delivery, the faster side for checking.

A cost-direct principle: quality costs time = money roughly in proportion to shifts. Running all songs at high shifts is wasteful. The two-tier of "check drafts at shifts=1 → only the adopted ones with htdemucs_ft + shifts=2–5" makes your production unit cost work hardest.


Use-Case Recipes (Code You Can Use As-Is)

A. ASR (Whisper) Preprocessing: Raise Transcription Accuracy Under BGM

Audio carrying music or noise is misrecognized by ASR if passed as-is. Extracting only the human voice with Demucs and then passing it to Whisper raises accuracy.

# 1) 声だけ抽出(伴奏・雑音を落とす)
demucs --two-stems=vocals -o sep podcast_with_bgm.wav

# 2) クリーンな声をWhisperへ(自前運用 or API は別記事参照)
whisper sep/htdemucs/podcast_with_bgm/vocals.wav --language ja

The detailed production design on the Whisper side (self-host vs API, cost, timestamps) is in the Whisper article. "Extract the voice with Demucs → transcribe with Whisper" is the standard preprocessing that raises the quality of video localization and meeting minutes.

B. Video Localization: Keep the BGM and Dub Into Another Language

When you replace the audio itself rather than the subtitles, it gets cheap-sounding if the BGM and sound effects vanish too. The crux is to peel off only the narration with Demucs and keep the BGM (no_vocals).

# 1) 動画から音声を抽出(44.1kHz)
ffmpeg -i source.mp4 -vn -ar 44100 audio.wav

# 2) ナレーション(vocals) と BGM・効果音(no_vocals) に分離
demucs --two-stems=vocals -o sep audio.wav

# 3) sep/htdemucs/audio/no_vocals.wav が「BGM・効果音だけ」のトラック。
#    これに新言語のナレーションをミックスすれば、世界観を保ったまま吹き替えできる。
ffmpeg -i sep/htdemucs/audio/no_vocals.wav -i narration_en.wav \
  -filter_complex amix=inputs=2:duration=longest dubbed_audio.wav

This is the first stage of my AI video-localization platform itself. Whether you can cleanly keep the BGM here decides the "authenticity" of the final dubbed video.

C. Karaoke / Off-Vocal for Sing-Along Covers

demucs --two-stems=vocals --mp3 --mp3-bitrate 320 song.mp3
# → no_vocals.mp3 がそのままカラオケ音源

D. Ear-Copying / Transcription: Extract Only Bass/Drums

demucs song.mp3
# separated/htdemucs/song/bass.wav だけをDAWに読み込めば、ベースラインが追いやすい

5 Pitfalls That Always Bite in Production, and Resilience Design

These are problems that don't happen with a demo (a 4-minute wav, one file) but erupt all at once in reality (tens of minutes, large volume, diverse formats). As seen in the mechanism chapter, these appear as a necessity of Demucs's design.

① Memory Exhaustion (OOM) at Long Audio / Parallelism

The Transformer has large memory consumption per interval. Stack long audio, increased shifts, and -j parallelism, and it falls over with out of memory on GPU or CPU (RAM) alike.

Countermeasure: because Demucs auto-segments long audio, the effective knobs are to lower segment, reduce jobs, and dial back shifts. In low-VRAM environments, the environment variable PYTORCH_NO_CUDA_MEMORY_CACHING=1 also helps. And treat OOM not as an exception but as a normal case, degrading gracefully in stages.

import torch
import demucs.api

def separate_resilient(path: str, model: str = "htdemucs"):
    """OOMを正常系として扱い、segment縮小→CPU退避の順に縮退する。"""
    device = "cuda" if torch.cuda.is_available() else "cpu"
    segment: float | None = None  # まずモデル既定で挑む

    for attempt in range(4):
        try:
            sep = demucs.api.Separator(model=model, device=device, segment=segment)
            return sep.separate_audio_file(path)
        except torch.cuda.OutOfMemoryError:
            torch.cuda.empty_cache()
            if segment is None:
                segment = 7.0          # htdemucsの上限(7.8)未満から開始
            elif segment > 4:
                segment = segment / 2   # まずsegmentを縮める
            elif device == "cuda":
                device, segment = "cpu", None  # 最後の砦: CPUへ退避(遅いが完走する)
            else:
                raise                   # CPUでも割れない=本物の異常。握りつぶさない
    raise RuntimeError("separation failed after retries")

② The CPU Fallback's "Works, but Painfully Slow"

If there's no GPU or a driver mismatch drops you to CPU, it takes about 1.5× the length of the source. About 6 minutes for a 4-minute song, about 1.5 hours for 1 hour of audio. It tends to become a "working but never finishing" state.

Countermeasure: always attach a timeout and progress logs to CPU execution. Lean batches toward the GPU, and accept the CPU as "the last fortress of fallback." Inspect torch.cuda.is_available() at startup, and make unintended CPU execution visible as a warning.

import torch, logging

logger = logging.getLogger("demucs")
if not torch.cuda.is_available():
    logger.warning("CUDA unavailable: running on CPU (~1.5x realtime). Batch jobs will be slow.")

③ ffmpeg Dependence / Format Incompatibility

What torchaudio can read is wav / mp3 / flac / ogg, etc. On Windows and some environments, ffmpeg is mandatory for reading/writing mp3, and without it you get a hard-to-understand failure of "works with wav but falls over with mp3."

Countermeasure: inserting one stage that normalizes to wav 44.1kHz with ffmpeg before feeding it in makes reproducibility and compatibility jump.

# 投入前の正規化: どんな入力でも 44.1kHz ステレオ wav に揃える
ffmpeg -y -i input.any -ar 44100 -ac 2 normalized.wav

④ Double-Processing (Separating the Same Audio Over and Over, Melting GPU Money)

With user resends, retries, and rapid clicks, separating the same audio over and over wastes GPU time directly. Separation is heavy processing, so a miss here directly hits cost.

Countermeasure: build a deterministic job key from the input's contents (the file's sha256) + the model + the parameters, and cache the result. For the same input, return the existing result without computing (idempotency).

import hashlib, json
from dataclasses import dataclass, asdict
from pathlib import Path

@dataclass(frozen=True, slots=True)
class SepParams:
    model: str = "htdemucs"
    two_stems: str | None = None
    shifts: int = 1
    overlap: float = 0.25

def file_sha256(path: Path) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(1 << 20), b""):  # 1MBずつ(大ファイルでもメモリ一定)
            h.update(chunk)
    return h.hexdigest()

def job_key(audio: Path, params: SepParams) -> str:
    payload = json.dumps({"audio": file_sha256(audio), **asdict(params)}, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()

def separate_idempotent(audio: Path, out_root: Path, params: SepParams) -> Path:
    dest = out_root / job_key(audio, params)
    if (dest / "vocals.wav").exists():   # 冪等性: 同じ入力×同じ設定は再計算しない
        return dest                       # 連打・リトライでもGPUを再消費しない
    dest.mkdir(parents=True, exist_ok=True)
    # ... ここで separate_resilient を呼び、dest 配下に保存 ...
    return dest

⑤ Checking Quality Only by "Ear"

In large batches, vocal residue and bleed into stems will surely slip through.

Countermeasure: after separation, mechanically check things like the stems' RMS / correlation, and send only those off the threshold to human review. Just watching "doesn't no_vocals retain too much human-voice component?" with a simple metric, for example, drastically cuts review effort. You don't need a perfect metric — the aim is to pick up outliers by machine.


Production-Operation Design Principles (Observability, Idempotency, Resilience, Cost)

Let me re-summarize the pitfall countermeasures in the language of operational quality. This is the difference between "works" and "doesn't fall over in production."

  • Idempotency: make sha256(audio contents + model + params) the job key and cache the result. Don't double-separate on resend, rapid clicks, or retries. The heavier the processing, the more it helps.
  • Resilience: treat OOM, CPU fallback, and format mismatch not as exceptions but as normal cases. With "shrink segment → fall back to CPU" and "ffmpeg-normalize before feeding in," prevent one file's failure from dragging the whole down.
  • Observability: leave in structured logs, per song, which model, what shifts/overlap, how many seconds it took, and whether the device was GPU or CPU. Make it a state where you can later trace "why is only this song slow / poor quality." Don't emit the audio content itself (PII) to logs.
  • Cost efficiency: carve the unit cost with 3 tiers — ① zero re-separation via the idempotency cache, ② drafts in htdemucs+shifts=1, deliverables only in htdemucs_ft, ③ fill the GPU batch and run it. Demucs's strength is that, being lightweight, the self-hosting unit cost is easy to produce.
  • API vs self-hosting: because Demucs is MIT and runs from 3GB VRAM, self-hosting is the baseline. Hosting like Replicate (e.g. cjwbw/demucs) is handy for "trying just a few without building an environment," but a steady large volume is far cheaper on your own GPU batch. The royal road is to first verify quality with the local CLI, and turn it into a GPU worker once volume grows (prices/specs vary, so check the primary source at use time).

Comparison with Other Source-Separation Tools

An organization to answer "so which do I use, after all?" Choosing by use is the right answer; there's no all-purpose one.

ToolMethodQuality (rough)Speed / lightnessLicenseSuited scene
Demucs v4 (HT Demucs)Hybrid (waveform+spectral) + TransformerSOTA-class among public models (9.0–9.20 dB)Fast on GPU, CPU-OK, from 3GB VRAMMITQuality-first, production separation in general
Spleeter (Deezer)CNN (spectrogram)A notch belowVery fastMITSpeed-first, large draft volumes
Open-Unmix (UMX)BiLSTMModest (reference implementation)LightweightMITResearch, baseline comparison
MDX-Net family (bundled in UVR, etc.)SpectrogramHigh (strong in ensembles)Depends on the modelVariousCombine with Demucs for an ensemble

Practical decisions are roughly thus.

  • Quality is paramount, and you want to clear commercial use under MITDemucs v4. First choice.
  • You just want to pre-process a large volume fast → two-tier of drafts with Spleeter → real processing of the adopted ones with Demucs.
  • You want to squeeze out the last few %ensemble Demucs (htdemucs_ft) with the MDX family (GUIs like UVR implement this).

Each tool's license/quality gets updated. Always confirm commercial use with the primary source. Demucs is MIT (official repository) at the time of writing.


Frequently Asked Questions (FAQ)

Q. Can I use it commercially? A. Demucs's code and models are under the MIT license, supporting commercial use. But the copyright / master rights of the separated audio itself are a separate matter. Acts like separating a commercial song and redistributing it secondarily require the rights holder's permission. Rights processing is the user's responsibility.

Q. Is a GPU mandatory? A. No. It runs on CPU too (slow, at about 1.5× real time). On a GPU it works from 3GB VRAM, and 7GB is a rough guide for comfortable operation at default settings. GPU is recommended for large batch processing.

Q. Can it do real time? A. It's basically offline processing. The design separates one piece at a time in bulk, and CPU is about 1.5× real time. Unsuited to strict low-latency live.

Q. Which model should I choose? A. The default htdemucs is sufficient first. Only for cuts needing the highest quality, htdemucs_ft (4× slower); if you also want to split guitar/piano, htdemucs_6s (don't over-expect the piano).

Q. I run out of memory on songs over 5 minutes. A. Demucs auto-segments long audio. On OOM, lower segment, reduce jobs, dial back shifts, and for low VRAM the environment variable PYTORCH_NO_CUDA_MEMORY_CACHING=1. If it's still tough, fall back to CPU (see the resilience code).

Q. How do I output to mp3? A. --mp3 --mp3-bitrate 320. The default is 16bit wav. For higher bit depth, --int24 / --float32; for flac, --flac.

Q. Can I use it not just for songs but for narration / meeting audio? A. You can. vocals captures the "human voice," so even for non-songs it's effective for separating narration / speech from BGM/noise. It actually helps in video localization and ASR preprocessing.

Q. Can I use it for Whisper preprocessing? A. It's effective. With --two-stems=vocals, make it voice-only and then pass it to ASR, and transcription accuracy under BGM/noise rises. Read it alongside the Whisper article.

Q. Which is cheaper, self-hosting or an API? A. Demucs is lightweight and MIT, so self-hosting is the baseline. If you just try a few, hosting (Replicate, etc.) is handy, but a steady large volume tends to favor self-operated GPU batches in unit cost. The royal road is to first check quality locally, and turn it into a worker once volume grows.


Summary: Moving Demucs from "Works" to "Earns in Production"

The essence of Demucs v4 (HT Demucs) lies in having both "the waveform listened to in time" and "the spectrogram listened to in frequency" at once, and cross-checking the long-range context with a Transformer. That's exactly why it produces SOTA-class quality among public models, and exactly why you need to understand and handle the design-rooted knobs segment, shifts, and overlap.

The path to implementation is simple.

  1. First check the quality of your own audio with the CLI (pip install -U demucsdemucs song.mp3) (no GPU needed, runs on CPU too).
  2. If you feel it, embed it in your own pipeline with the Python API (demucs.api.Separator) and refine it with the per-use presets.
  3. Production weaves idempotency, resilience, observability, and cost into the design — a sha256 idempotency key, staged degradation on OOM, ffmpeg normalization, structured logging.

Only after going this far does it become not a demo but a product that "doesn't fall over on the customer's material." And, the point I most want to convey — this very chain of design judgments is where outsourcing makes a difference. Anyone can make a "just call demucs" demo, but a foundation that doesn't break under the reality of long audio, diverse formats, and large batches turns the number of mines you've stepped on directly into quality.

I implemented this article's pitfalls and resilience design in the first stage (audio separation) of an AI video-localization platform actually running in production. If you're considering building or improving an audio/video AI pipeline including source separation, see my portfolio and feel free to consult me. With one person × generative AI, I build through, from PoC to production operation, fast, cheap, and safe.


Sources / Official Resources

  • Versions, parameters, pricing, and licenses get updated. Always confirm the primary source before implementing. This article's numbers (SDR 9.0 / 9.20 dB, segment max 7.8 seconds, 3GB/7GB VRAM, shifts +up to 0.2dB, CPU about 1.5× real time, etc.) are based on the official information at the time of writing.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading