# Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs

> An explanation of Meta's source-separation model Demucs v4 (HT Demucs), faithful to the official documentation (GitHub, paper). The mechanism of waveform × spectrogram × Transformer, how to choose among the htdemucs-family models, CLI and Python API implementation, real recipes for vocal separation / karaoke / ASR preprocessing / video localization, and long-audio OOM, idempotency, and resilience — the production-operations design shown with concrete code.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Demucs, 音源分離, 音声処理, Python, GPU, 生成AI, MLOps
- URL: https://tomodahinata.com/en/blog/demucs-v4-music-source-separation-production-guide
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- Demucs v4 = HT Demucs. A hybrid bi-U-Net that runs an encoder seeing the waveform and an encoder seeing the spectrogram in parallel, bridging the innermost layers with a cross-domain Transformer. 9.0 dB SDR on MUSDB HQ, and 9.20 dB with the fine-tuned htdemucs_ft — SOTA-class among public models
- Separates into 4 stems (drums / bass / other / vocals, 44.1kHz stereo wav). htdemucs_6s adds guitar/piano. `--two-stems=vocals` gives a 'vocals / no_vocals' karaoke split in one command
- Installation is a single line, `pip install -U demucs`. It runs even on CPU (about 1.5× real time), and the GPU works from 3GB VRAM. MIT license, commercial-OK. v4 is mature and won't bring breaking changes = suited to production
- Uses: vocal removal/karaoke, ASR preprocessing (improving Whisper accuracy), keeping BGM in video localization, ear-copying / music education, remixing. Control quality and cost with shifts / overlap / segment / mp3 preset
- In production, long-audio memory exhaustion, ffmpeg dependence, the painfully-slow CPU fallback, and double-processing will surely clog. Guarantee resilience and cost efficiency with segment-shrink → CPU-fallback OOM recovery, a sha256 idempotency key, and structured logging

---

## The Goal of This Article

Demucs is a **music source separation (MSS) model**, published by Meta (formerly Facebook AI Research), that **decomposes a single track or piece of audio into "vocals / drums / bass / other."** Its 4th generation, **Demucs v4 = HT Demucs (Hybrid Transformer Demucs)**, was proposed in the paper [_Hybrid Transformers for Music Source Separation_ (ICASSP 2023)](https://arxiv.org/abs/2211.08553) and achieves **SOTA-class separation quality** as a public model.

This article, while **strictly based on the official documentation ([GitHub](https://github.com/adefossez/demucs) / [docs/api.md](https://github.com/adefossez/demucs/blob/main/docs/api.md) / [paper](https://arxiv.org/abs/2211.08553))**, fills in — with actually-running code — "**in which scene, how to use it, and where it clogs**," which isn't written in the official README. By the time you finish reading, I aim for a state where you can do the following 3 things.

1. Explain to others **what kind of model Demucs v4 is, and why its quality is high**.
2. Use the **CLI (to try)** and the **Python API (to embed)** as appropriate, and **get your hands moving today**.
3. Assemble a **resilient implementation** that withstands not a demo but **production** — long-audio memory exhaustion, ffmpeg dependence, double-processing.

> **About the author (disclosure for credibility)**: I have single-handedly designed, implemented, and **run in production** an **AI video-localization platform** that fully automates, from just uploading a video, everything from "**audio separation → transcription → translation → multilingual dubbing → lip-sync**." Its **first stage (audio separation)** — separating "human voice" from "BGM / sound effects" in the source video, and **swapping in a different-language narration while keeping the BGM** — is handled precisely by the Demucs of this article. This article's "pitfalls" and "resilience design" are not demo knowledge but a record of the mines I stepped on in that real operation. The whole pipeline's design is in a [separate article](/blog/production-ai-video-localization-lipsync-gpu-pipeline), and the project overview is compiled at the [portfolio link](/case-studies/ai-video-localization-lipsync) at the end of this article.

---

## A 30-Second Summary (Conclusion First)

| Viewpoint | Conclusion |
| --- | --- |
| **What kind of model** | A source-separation model that decomposes a single piece of audio into 4 stems — "**vocals / drums / bass / other**" |
| **What's impressive** | Processes **both waveform and spectrogram** in parallel, bridging them with a Transformer. SOTA-class among public models with **9.0 dB SDR** on MUSDB HQ (**9.20 dB** with the fine-tuned `htdemucs_ft`) |
| **Generation** | **v4 = HT Demucs**. The paper is ICASSP 2023 (Rouard, Massa, Défossez). The current **stable latest** |
| **Ease of installation** | A **single line**, `pip install -U demucs`. **Runs even on CPU** (about 1.5× real time). GPU from **3GB VRAM** |
| **License** | **MIT**. Commercial use OK (but the master rights / copyright of the separated audio are a separate matter) |
| **Just to try** | **One CLI shot**: `demucs song.mp3`. Output is `separated/htdemucs/song/{vocals,drums,bass,other}.wav` |
| **To build in** | **Python API**: receive tensors with `demucs.api.Separator` and embed them in your own pipeline |
| **The main quality knobs** | `shifts` (averaging predictions for +up to 0.2dB, N× slower) / `overlap` (default 0.25) / `segment` (memory and quality) |
| **Suited uses** | Karaoke/vocal removal, **ASR preprocessing (improving Whisper accuracy)**, keeping BGM in video localization, ear-copying / education, remixing |
| **Unsuited uses** | Strict zero-latency live streaming (basically offline processing) |

If you want to "**first check the quality on your own audio**," jump straight to [Usage A: First, Try It (CLI One-Liner)](#usage-a-first-try-it-cli-one-liner) right after this. Including `pip install`, you'll get a result in a few minutes.

---

## What Kind of Model Demucs v4 Is

The input is **a single audio file** (a track, narration, recorded stream, etc.). The output is **4 stems** — **vocals, drums, bass, and other** — all written out as the same **44.1kHz stereo wav** as the original. Choose `htdemucs_6s` and **guitar and piano** are added here for 6 stems.

This works in scenes like the following, for example.

- **Karaoke / vocal removal**: `--two-stems=vocals` splits into 2 — "**vocals**" and "**no_vocals** (accompaniment)." Off-vocal tracks for sing-along covers and practice karaoke can be made in one command.
- **ASR (transcription) preprocessing**: **extracting only the human voice and then passing it to Whisper** from audio carrying BGM or noise raises transcription accuracy under music or noise (see the [recipe](#use-case-recipes-code-you-can-use-as-is) later). A standard technique to combine with the [Whisper article](/blog/openai-whisper-production-guide-selfhost-vs-api).
- **Video localization (dubbing)**: split the source video's audio into "narration" and "BGM / sound effects," and **swap in a different-language narration while keeping the BGM**. The immersion is **on a different level** from subtitles or a flat replacement, raising the CV of overseas expansion. The first stage of my platform is this.
- **Ear-copying / music education**: extract only the bass or only the drums to use for **copying, transcription, and rhythm practice**.
- **Remix / sampling / DJ**: extract stems and reconstruct. Extracting a cappella or instrumental material.

Conversely, it's **unsuited to strict real-time streaming** (uses needing a response within tens of ms). Demucs is basically **offline processing that handles one piece at a time in bulk**, and on CPU it takes **roughly 1.5× the length of the source** (GPU is far faster). For live use, consider a separate lightweight model.

---

## The Mechanism: Why "Waveform × Spectrogram × Transformer" (Made Gentle, Faithful to the Paper)

This chapter breaks down the core of the official paper **while keeping it accurate**. Those who want only the implementation may jump to [how to choose a model](#how-to-choose-a-model-htdemucs-htdemucs_ft-htdemucs_6s-mdx). That said, "why the pitfalls appear in that shape" makes sense once you understand here.

### Source Separation Has "Two Ways of Seeing"

When you hand sound to a machine, there are broadly 2 ways of representing it.

- **Waveform (temporal)**: the time axis itself. Strong on **temporal details** like the sharpness of an attack and phase.
- **Spectrogram (spectral / STFT)**: an image of frequency × time. Closer to how a human "tells instruments apart" — **what is sounding in which frequency band**.

Past source-separation models leaned to one or the other. **Demucs v3 (Hybrid Demucs) unified both into one net**, and v4 layered a **Transformer** on top of that.

### The Structure of v4: Hybrid bi-U-Net + cross-domain Transformer

Breaking down the definition in the paper ([arXiv:2211.08553](https://arxiv.org/abs/2211.08553)) as-is, HT Demucs has this shape.

- It runs **2 parallel encoders**. One processes the **waveform**, the other the **spectrogram** (a temporal/spectral **bi-U-Net**).
- It **replaces the innermost layers with a cross-domain Transformer Encoder**.
- The Transformer integrates information with **self-attention within the same domain** (waveform with waveform) and **cross-attention across domains** (waveform ↔ spectrogram).

Intuitively, it has both "**an ear that listens in time**" and "**an ear that listens in frequency**" at once, and uses the Transformer's attention mechanism to **cross-check the long-range context of both** to judge which sound belongs to which stem. The question the paper posed was "**is long-range context useful for source separation, or are local acoustic features enough?**" and the answer was "**useful**."

### Quality: 9.0–9.20 dB SDR on MUSDB HQ

Separation quality is measured by **SDR (Signal-to-Distortion Ratio; higher is better)**. The official reports are as follows.

| Model | SDR (MUSDB HQ) | Notes |
| --- | --- | --- |
| Hybrid Demucs (v3) | about **7.7 dB** | No extra data |
| **HT Demucs (v4, htdemucs)** | **9.0 dB** | Trained on MusDB + 800 songs |
| **HT Demucs f.t. (v4, htdemucs_ft)** | **9.20 dB** | sparse attention + per-source fine-tuning. **SOTA-class among public models** |

The paper reports that, when trained with 800 additional songs, HT Demucs surpassed Hybrid Demucs by **0.45 dB**, and further reached **9.20 dB** by adding **sparse attention (expanding the receptive field with a sparse attention mechanism)** and **per-source fine-tuning**.

### This Mechanism Decides "the Shape of the Pitfalls"

Pin down the structure, and the troubles described later turn out to be **inevitable**.

- A Transformer model has **large memory consumption per segment**, and `htdemucs` has the constraint of a **maximum of 7.8 seconds per segment**. So long audio is **automatically split into segments**, and the memory pressure is decided by `segment`, `jobs`, and `shifts`.
- Carrying **both** waveform and spectrogram makes the computation heavy, so it's **slow on CPU** (GPU recommended, but it runs from 3GB).
- The quality knob (`shifts`) **averages predictions multiple times**, so quality and time trade off cleanly.

In other words, the parameters are not knobs to turn by "mood"; they are **knobs that hold meaning along the extension of this design**.

---

## How to Choose a Model: htdemucs, htdemucs_ft, htdemucs_6s, mdx

Demucs bundles multiple pretrained models, chosen with `-n` (CLI) or `model=` (API). The choice in practice is simple.

| Model name | Stems | Speed | Quality | Choose it thus |
| --- | --- | --- | --- | --- |
| **`htdemucs`** (default) | 4 (vocals/drums/bass/other) | Standard | High | **Just go with this first.** The v4 standard. Sufficient for many uses |
| **`htdemucs_ft`** | 4 | **about 4× slower** | Highest | Quality first. A configuration bundling models fine-tuned specially for each source (official: _"4 times more time but might be a bit better"_) |
| **`htdemucs_6s`** | 6 (+guitar/piano) | Standard | High (piano is weak) | When you also want to split guitar / piano. **But the official explicitly states piano quality is "not good yet"** |
| **`hdemucs_mmi`** | 4 | Fast-ish | Medium–high | A retrained version of v3 (Hybrid Demucs). When you want it light, without the Transformer |
| **`mdx` / `mdx_extra`** | 4 | Fast-ish | Medium–high | For the old MDX challenge. `mdx_extra` includes extra data |
| **`mdx_q` / `mdx_extra_q`** | 4 | Fast | Medium | **Quantized versions**. When you want to keep memory / size down |

**Shortcuts for the decision:**

- You want **the best standard for now** → **`htdemucs`** (default). No need to specify.
- You want **the last few % of quality** (deliverables, hero cuts) → **`htdemucs_ft`**. In exchange for 4× the time.
- You **also want to split guitar/piano** → **`htdemucs_6s`**. But don't over-expect the piano.
- **Memory/size is tight** (edge, large batches) → the quantized **`mdx_q` / `mdx_extra_q`**.

> 💡 The reason `htdemucs_ft` is "4× slower" is that it internally **runs multiple per-source fine-tuned models in sequence**. The return value of `demucs.api.list_models()` being split into `{"single": [...], "bag": [...]}` is for this reason, and `htdemucs_ft` belongs to the **bag (bundle of models)** side. Since the quality gain is "slight," the two-tier setup of **drafts in `htdemucs`, deliverables only in `htdemucs_ft`** is the cost-efficient usage.

---

## Usage A: First, Try It (CLI One-Liner)

Demucs's biggest strength is that it's **lightweight to install**. Unlike a heavy diffusion model like LatentSync, it works **even without a GPU** (just slowly). First, feel the quality on your own audio.

### 1. Install

```bash
# Python 3.8+ が前提。仮想環境を切ってから入れるのが安全
python3 -m pip install -U demucs

# mp3 など wav 以外を扱うなら ffmpeg も入れる（Windowsは特に必須）
#   macOS:  brew install ffmpeg
#   Ubuntu: sudo apt install ffmpeg
```

### 2. Separate (One Command)

```bash
# 既定モデル(htdemucs)で4ステムに分離
demucs song.mp3

# 出力はこの構成で書き出される:
# separated/htdemucs/song/
#   ├── drums.wav
#   ├── bass.wav
#   ├── other.wav
#   └── vocals.wav
```

### 3. Frequently Used Patterns

```bash
# ① カラオケ/ボーカル除去: vocals と no_vocals の2本だけ作る
demucs --two-stems=vocals song.mp3
#   → separated/htdemucs/song/vocals.wav, no_vocals.wav

# ② GPUを明示（既定は自動。CPUを強制したいなら -d cpu）
demucs -d cuda song.mp3

# ③ 最高品質モデル + mp3で書き出し(320kbps)
demucs -n htdemucs_ft --mp3 --mp3-bitrate 320 song.mp3

# ④ 出力先を指定 + 複数ファイルを一括
demucs -o ./out track1.wav track2.flac track3.mp3

# ⑤ 6ステム(ギター/ピアノ込み)
demucs -n htdemucs_6s song.mp3
```

`demucs --help` shows all flags. The main ones are as follows.

| Flag | Role | Default |
| --- | --- | --- |
| `-n MODEL` | Model selection | `htdemucs` |
| `--two-stems=vocals` | Narrow to 2 stems (the specified source / everything else) | off (4 stems) |
| `-d {cpu,cuda}` | Compute device | auto |
| `--mp3` / `--flac` | Output format (default is 16bit wav) | wav |
| `--mp3-bitrate` | mp3 bitrate (kbps) | `320` |
| `--mp3-preset` | mp3 quality (2=best to 7=fastest) | — |
| `--int24` / `--float32` | Save wav as 24bit int / 32bit float | 16bit |
| `--shifts N` | Average N times with random time shifts (quality↑, N× slower) | `1` |
| `--overlap` | Segment overlap ratio | `0.25` |
| `--segment SEC` | Segment length (htdemucs is **max 7.8**) | model default |
| `-j JOBS` | Number of parallel jobs (faster but memory↑) | `1` |
| `--clip-mode {rescale,clamp}` | Clipping handling | `rescale` |
| `-o DIR` | Output directory | `separated` |

---

## Usage B: Embed via the Python API

To embed in your own server or batch, the **Python API (`demucs.api`) is better than shelling out to the CLI**. Because you can receive tensors directly, you can connect them straight to downstream stages (mixing, encoding, uploading).

### Minimal Setup

Use the API of the official [`docs/api.md`](https://github.com/adefossez/demucs/blob/main/docs/api.md) as-is.

```python
import demucs.api

# Separatorを1度だけ生成してモデルをロード（使い回す）
separator = demucs.api.Separator(model="htdemucs")

# 分離: 戻り値は (元の波形, {ステム名: テンソル}) のタプル
origin, stems = separator.separate_audio_file("song.mp3")
# stems = {"drums": Tensor, "bass": Tensor, "other": Tensor, "vocals": Tensor}

# 各ステムを書き出す
for name, source in stems.items():
    demucs.api.save_audio(source, f"{name}.wav", samplerate=separator.samplerate)
```

The `Separator` constructor arguments (faithful to the official docs):

| Argument | Meaning |
| --- | --- |
| `model="htdemucs"` | Model name |
| `repo=None` | Local model storage location (for offline operation) |
| `device=None` | `"cuda"` / `"cpu"` / `torch.device`. `None` is auto |
| `shifts=None` | >0 for random-shift averaging (SDR +up to 0.2, slower accordingly) |
| `overlap=None` | Segment overlap ratio (default 0.25-equivalent) |
| `split=None` | Whether to chunk long audio |
| `segment=None` | Segment length (seconds). htdemucs is max 7.8 |
| `jobs=None` | Number of parallel jobs |
| `progress=False` | Show a progress bar |
| `callback=None` | A function called at chunk start/end (used for progress / observation) |

### Make Karaoke (vocals / no_vocals) via the API

To do the CLI's `--two-stems` equivalent via the API, **sum the non-vocals**.

```python
import demucs.api

separator = demucs.api.Separator(model="htdemucs")
origin, stems = separator.separate_audio_file("song.mp3")

# ボーカル以外を足し合わせれば「伴奏(no_vocals)」になる
no_vocals = sum(src for name, src in stems.items() if name != "vocals")

demucs.api.save_audio(stems["vocals"], "vocals.wav", samplerate=separator.samplerate)
demucs.api.save_audio(no_vocals, "no_vocals.wav", samplerate=separator.samplerate)
```

### Observe Progress (callback)

For long audio, you want to show the user progress. You can pick up chunk progress with `callback`.

```python
def on_progress(data: dict) -> None:
    # data には現在のセグメント番号・総数・モデル状態などが入る
    # 本番では構造化ログ/メトリクスに流す（標準出力にPIIを出さない）
    if data.get("state") == "end":
        print(f"segment {data.get('segment_offset')} done")

separator = demucs.api.Separator(model="htdemucs", callback=on_progress, progress=True)
```

---

## Practical Parameter Tuning: Presets by Use

The official docs show the existence of the knobs and their defaults. Here I add **the right values in practice**. The table below is a ready-to-use preset.

| Use | Model | `shifts` | `overlap` | `segment` | Output | Aim |
| --- | --- | --- | --- | --- | --- | --- |
| **Draft check** (internal review) | `htdemucs` | 1 | 0.25 | default | mp3 320 | See the shape fast and cheap |
| **Standard delivery** (distribution/PR) | `htdemucs` | 2 | 0.25 | default | wav | The best point of quality and speed |
| **Highest quality** (hero / mastering) | `htdemucs_ft` | 5 | 0.5 | default | float32 | Details first (time is several ×) |
| **Memory-saving** (small VRAM/CPU) | `mdx_q` | 1 | 0.1 | smaller | mp3 | Avoid OOM, prioritize finishing |
| **ASR preprocessing** (to Whisper) | `htdemucs` | 1 | 0.25 | default | wav (vocals only) | Speed first, need only the voice |

Rephrasing the meaning of the knobs in practical terms:

- **`shifts` (time-shift averaging)**: shift the input little by little and **predict N times and average**. Quality rises (official: **up to +0.2 dB SDR**), but **time is N×**. The standard is `2–5` for deliverables only, `1` for drafts.
- **`overlap` (segment overlap)**: default **0.25**. Raise it and seam artifacts decrease but it **gets slower**. Set to `0.5` only for material where the seams bother you.
- **`segment` (the length of one interval)**: the **main knob for memory and quality**. `htdemucs` is **max 7.8 seconds**. **Lower it if OOM occurs.** Lower it too much and the seams increase, so first the default → shrink gradually if it fails.
- **`-j / jobs` (parallelism)**: you can process multiple songs at once and go faster, but **memory increases by the parallelism**. Consult your VRAM/RAM.
- **`--mp3-preset`**: `2` (highest quality, slow) to `7` (fastest, low quality). 2–3 for delivery, the faster side for checking.

> **A cost-direct principle**: quality costs **time = money roughly in proportion to `shifts`**. Running all songs at high `shifts` is wasteful. The two-tier of "**check drafts at `shifts=1` → only the adopted ones with `htdemucs_ft` + `shifts=2–5`**" makes your production unit cost work hardest.

---

## Use-Case Recipes (Code You Can Use As-Is)

### A. ASR (Whisper) Preprocessing: Raise Transcription Accuracy Under BGM

Audio carrying music or noise is misrecognized by ASR if passed as-is. **Extracting only the human voice with Demucs and then passing it to Whisper** raises accuracy.

```bash
# 1) 声だけ抽出（伴奏・雑音を落とす）
demucs --two-stems=vocals -o sep podcast_with_bgm.wav

# 2) クリーンな声をWhisperへ（自前運用 or API は別記事参照）
whisper sep/htdemucs/podcast_with_bgm/vocals.wav --language ja
```

The detailed production design on the Whisper side (self-host vs API, cost, timestamps) is in the [Whisper article](/blog/openai-whisper-production-guide-selfhost-vs-api). **"Extract the voice with Demucs → transcribe with Whisper" is the standard preprocessing that raises the quality of video localization and meeting minutes.**

### B. Video Localization: Keep the BGM and Dub Into Another Language

When you **replace the audio itself** rather than the subtitles, it gets cheap-sounding if the BGM and sound effects vanish too. The crux is to **peel off only the narration with Demucs and keep the BGM (no_vocals)**.

```bash
# 1) 動画から音声を抽出（44.1kHz）
ffmpeg -i source.mp4 -vn -ar 44100 audio.wav

# 2) ナレーション(vocals) と BGM・効果音(no_vocals) に分離
demucs --two-stems=vocals -o sep audio.wav

# 3) sep/htdemucs/audio/no_vocals.wav が「BGM・効果音だけ」のトラック。
#    これに新言語のナレーションをミックスすれば、世界観を保ったまま吹き替えできる。
ffmpeg -i sep/htdemucs/audio/no_vocals.wav -i narration_en.wav \
  -filter_complex amix=inputs=2:duration=longest dubbed_audio.wav
```

This is the **first stage** of my AI video-localization platform itself. Whether you can cleanly keep the BGM here decides the "**authenticity**" of the final dubbed video.

### C. Karaoke / Off-Vocal for Sing-Along Covers

```bash
demucs --two-stems=vocals --mp3 --mp3-bitrate 320 song.mp3
# → no_vocals.mp3 がそのままカラオケ音源
```

### D. Ear-Copying / Transcription: Extract Only Bass/Drums

```bash
demucs song.mp3
# separated/htdemucs/song/bass.wav だけをDAWに読み込めば、ベースラインが追いやすい
```

---

## 5 Pitfalls That Always Bite in Production, and Resilience Design

These are problems that don't happen with a demo (a 4-minute wav, one file) but **erupt all at once in reality (tens of minutes, large volume, diverse formats)**. As seen in the mechanism chapter, these appear as **a necessity of Demucs's design**.

### ① Memory Exhaustion (OOM) at Long Audio / Parallelism

The Transformer has large memory consumption per interval. Stack long audio, increased `shifts`, and `-j` parallelism, and it falls over with **out of memory** on GPU or CPU (RAM) alike.

**Countermeasure**: because Demucs **auto-segments** long audio, the effective knobs are to **lower `segment`, reduce `jobs`, and dial back `shifts`**. In low-VRAM environments, the environment variable **`PYTORCH_NO_CUDA_MEMORY_CACHING=1`** also helps. And treat **OOM not as an exception but as a normal case**, **degrading gracefully in stages**.

```python
import torch
import demucs.api

def separate_resilient(path: str, model: str = "htdemucs"):
    """OOMを正常系として扱い、segment縮小→CPU退避の順に縮退する。"""
    device = "cuda" if torch.cuda.is_available() else "cpu"
    segment: float | None = None  # まずモデル既定で挑む

    for attempt in range(4):
        try:
            sep = demucs.api.Separator(model=model, device=device, segment=segment)
            return sep.separate_audio_file(path)
        except torch.cuda.OutOfMemoryError:
            torch.cuda.empty_cache()
            if segment is None:
                segment = 7.0          # htdemucsの上限(7.8)未満から開始
            elif segment > 4:
                segment = segment / 2   # まずsegmentを縮める
            elif device == "cuda":
                device, segment = "cpu", None  # 最後の砦: CPUへ退避（遅いが完走する）
            else:
                raise                   # CPUでも割れない＝本物の異常。握りつぶさない
    raise RuntimeError("separation failed after retries")
```

### ② The CPU Fallback's "Works, but Painfully Slow"

If there's no GPU or a driver mismatch drops you to CPU, it takes **about 1.5× the length of the source**. About 6 minutes for a 4-minute song, about 1.5 hours for 1 hour of audio. It tends to become a "working but never finishing" state.

**Countermeasure**: always attach **a timeout and progress logs** to CPU execution. Lean batches toward the GPU, and accept the CPU as "the last fortress of fallback." Inspect `torch.cuda.is_available()` at startup, and **make unintended CPU execution visible as a warning**.

```python
import torch, logging

logger = logging.getLogger("demucs")
if not torch.cuda.is_available():
    logger.warning("CUDA unavailable: running on CPU (~1.5x realtime). Batch jobs will be slow.")
```

### ③ ffmpeg Dependence / Format Incompatibility

What `torchaudio` can read is wav / mp3 / flac / ogg, etc. **On Windows and some environments, ffmpeg is mandatory for reading/writing mp3**, and without it you get a hard-to-understand failure of "works with wav but falls over with mp3."

**Countermeasure**: inserting one stage that **normalizes to wav 44.1kHz with ffmpeg** before feeding it in makes reproducibility and compatibility jump.

```bash
# 投入前の正規化: どんな入力でも 44.1kHz ステレオ wav に揃える
ffmpeg -y -i input.any -ar 44100 -ac 2 normalized.wav
```

### ④ Double-Processing (Separating the Same Audio Over and Over, Melting GPU Money)

With user resends, retries, and rapid clicks, **separating the same audio over and over** wastes GPU time directly. Separation is heavy processing, so a miss here directly hits cost.

**Countermeasure**: build a deterministic job key from **the input's contents (the file's sha256) + the model + the parameters**, and **cache the result**. For the same input, return the existing result without computing (idempotency).

```python
import hashlib, json
from dataclasses import dataclass, asdict
from pathlib import Path

@dataclass(frozen=True, slots=True)
class SepParams:
    model: str = "htdemucs"
    two_stems: str | None = None
    shifts: int = 1
    overlap: float = 0.25

def file_sha256(path: Path) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(1 << 20), b""):  # 1MBずつ（大ファイルでもメモリ一定）
            h.update(chunk)
    return h.hexdigest()

def job_key(audio: Path, params: SepParams) -> str:
    payload = json.dumps({"audio": file_sha256(audio), **asdict(params)}, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()

def separate_idempotent(audio: Path, out_root: Path, params: SepParams) -> Path:
    dest = out_root / job_key(audio, params)
    if (dest / "vocals.wav").exists():   # 冪等性: 同じ入力×同じ設定は再計算しない
        return dest                       # 連打・リトライでもGPUを再消費しない
    dest.mkdir(parents=True, exist_ok=True)
    # ... ここで separate_resilient を呼び、dest 配下に保存 ...
    return dest
```

### ⑤ Checking Quality Only by "Ear"

In large batches, **vocal residue** and **bleed into stems** will surely slip through.

**Countermeasure**: after separation, **mechanically check things like the stems' RMS / correlation**, and send only those off the threshold to human review. Just watching "doesn't no_vocals retain too much human-voice component?" with a simple metric, for example, drastically cuts review effort. You don't need a perfect metric — the aim is to **pick up outliers by machine**.

---

## Production-Operation Design Principles (Observability, Idempotency, Resilience, Cost)

Let me re-summarize the pitfall countermeasures in the language of operational quality. This is the difference between "works" and "doesn't fall over in production."

- **Idempotency**: make `sha256(audio contents + model + params)` the **job key** and cache the result. Don't double-separate on resend, rapid clicks, or retries. The heavier the processing, the more it helps.
- **Resilience**: treat OOM, CPU fallback, and format mismatch **not as exceptions but as normal cases**. With "shrink `segment` → fall back to CPU" and "ffmpeg-normalize before feeding in," prevent one file's failure from dragging the whole down.
- **Observability**: leave in structured logs, per song, **which model, what shifts/overlap, how many seconds it took, and whether the device was GPU or CPU**. Make it a state where you can later trace "why is only this song slow / poor quality." Don't emit the audio content itself (PII) to logs.
- **Cost efficiency**: carve the unit cost with 3 tiers — ① zero re-separation via the idempotency cache, ② drafts in `htdemucs`+`shifts=1`, deliverables only in `htdemucs_ft`, ③ fill the GPU batch and run it. Demucs's strength is that, being **lightweight, the self-hosting unit cost is easy to produce**.
- **API vs self-hosting**: because Demucs is **MIT and runs from 3GB VRAM**, **self-hosting is the baseline**. Hosting like Replicate (e.g. [`cjwbw/demucs`](https://replicate.com/cjwbw/demucs)) is handy for "trying just a few without building an environment," but **a steady large volume is far cheaper on your own GPU batch**. The royal road is to first verify quality with the local CLI, and turn it into a GPU worker once volume grows (prices/specs vary, so check the primary source at use time).

---

## Comparison with Other Source-Separation Tools

An organization to answer "so which do I use, after all?" **Choosing by use** is the right answer; there's no all-purpose one.

| Tool | Method | Quality (rough) | Speed / lightness | License | Suited scene |
| --- | --- | --- | --- | --- | --- |
| **Demucs v4 (HT Demucs)** | Hybrid (waveform+spectral) + Transformer | **SOTA-class among public models** (9.0–9.20 dB) | Fast on GPU, **CPU-OK**, from 3GB VRAM | **MIT** | Quality-first, production separation in general |
| **Spleeter (Deezer)** | CNN (spectrogram) | A notch below | **Very fast** | MIT | Speed-first, large draft volumes |
| **Open-Unmix (UMX)** | BiLSTM | Modest (reference implementation) | Lightweight | MIT | Research, baseline comparison |
| **MDX-Net family** (bundled in UVR, etc.) | Spectrogram | High (strong in ensembles) | Depends on the model | Various | **Combine with Demucs** for an ensemble |

Practical decisions are roughly thus.

- **Quality is paramount, and you want to clear commercial use under MIT** → **Demucs v4**. First choice.
- **You just want to pre-process a large volume fast** → two-tier of drafts with Spleeter → real processing of the adopted ones with Demucs.
- **You want to squeeze out the last few %** → **ensemble** Demucs (`htdemucs_ft`) with the MDX family (GUIs like UVR implement this).

> Each tool's license/quality gets updated. **Always confirm commercial use with the primary source.** Demucs is **MIT** (official repository) at the time of writing.

---

## Frequently Asked Questions (FAQ)

**Q. Can I use it commercially?**
A. Demucs's code and models are under the **MIT license**, supporting commercial use. But **the copyright / master rights of the separated audio itself are a separate matter**. Acts like separating a commercial song and redistributing it secondarily require the rights holder's permission. Rights processing is the user's responsibility.

**Q. Is a GPU mandatory?**
A. No. **It runs on CPU too** (slow, at about 1.5× real time). On a GPU it works from **3GB VRAM**, and **7GB** is a rough guide for comfortable operation at default settings. GPU is recommended for large batch processing.

**Q. Can it do real time?**
A. It's basically **offline processing**. The design separates one piece at a time in bulk, and CPU is about 1.5× real time. Unsuited to strict low-latency live.

**Q. Which model should I choose?**
A. The default **`htdemucs`** is sufficient first. Only for cuts needing the highest quality, **`htdemucs_ft`** (4× slower); if you also want to split guitar/piano, **`htdemucs_6s`** (don't over-expect the piano).

**Q. I run out of memory on songs over 5 minutes.**
A. Demucs **auto-segments** long audio. On OOM, **lower `segment`, reduce `jobs`, dial back `shifts`**, and for low VRAM the environment variable **`PYTORCH_NO_CUDA_MEMORY_CACHING=1`**. If it's still tough, **fall back to CPU** (see the [resilience code](#5-pitfalls-that-always-bite-in-production-and-resilience-design)).

**Q. How do I output to mp3?**
A. `--mp3 --mp3-bitrate 320`. The default is **16bit wav**. For higher bit depth, `--int24` / `--float32`; for flac, `--flac`.

**Q. Can I use it not just for songs but for narration / meeting audio?**
A. You can. `vocals` captures the "**human voice**," so even for non-songs it's effective for **separating narration / speech from BGM/noise**. It actually helps in video localization and ASR preprocessing.

**Q. Can I use it for Whisper preprocessing?**
A. It's effective. With `--two-stems=vocals`, **make it voice-only and then pass it to ASR**, and transcription accuracy under BGM/noise rises. Read it alongside the [Whisper article](/blog/openai-whisper-production-guide-selfhost-vs-api).

**Q. Which is cheaper, self-hosting or an API?**
A. Demucs is lightweight and MIT, so **self-hosting is the baseline**. If you just try a few, hosting (Replicate, etc.) is handy, but **a steady large volume tends to favor self-operated GPU batches in unit cost**. The royal road is to first check quality locally, and turn it into a worker once volume grows.

---

## Summary: Moving Demucs from "Works" to "Earns in Production"

The essence of Demucs v4 (HT Demucs) lies in **having both "the waveform listened to in time" and "the spectrogram listened to in frequency" at once, and cross-checking the long-range context with a Transformer**. That's exactly why it produces SOTA-class quality among public models, and exactly why you need to understand and handle the **design-rooted knobs** `segment`, `shifts`, and `overlap`.

The path to implementation is simple.

1. First check the quality of your own audio with the **CLI (`pip install -U demucs` → `demucs song.mp3`)** (no GPU needed, runs on CPU too).
2. If you feel it, embed it in your own pipeline with the **Python API (`demucs.api.Separator`)** and refine it with the per-use presets.
3. **Production weaves idempotency, resilience, observability, and cost** into the design — a sha256 idempotency key, staged degradation on OOM, ffmpeg normalization, structured logging.

Only after going this far does it become not a demo but a product that "**doesn't fall over on the customer's material**." And, the point I most want to convey — **this very chain of design judgments is where outsourcing makes a difference.** Anyone can make a "just call `demucs`" demo, but **a foundation that doesn't break under the reality of long audio, diverse formats, and large batches** turns the number of mines you've stepped on directly into quality.

> I implemented this article's pitfalls and resilience design in the first stage (audio separation) of an **AI video-localization platform actually running in production**. If you're considering building or improving an audio/video AI pipeline including source separation, see my [portfolio](/case-studies/ai-video-localization-lipsync) and feel free to consult me. With **one person × generative AI**, I build through, from PoC to production operation, fast, cheap, and safe.

---

## Sources / Official Resources

- **Official repository (GitHub)**: [adefossez/demucs](https://github.com/adefossez/demucs) — README, CLI flags, model list, hardware requirements
- **Python API documentation**: [docs/api.md](https://github.com/adefossez/demucs/blob/main/docs/api.md) — `demucs.api.Separator` / `save_audio` / `list_models`
- **Paper (arXiv)**: [Hybrid Transformers for Music Source Separation (2211.08553)](https://arxiv.org/abs/2211.08553) — Rouard, Massa, Défossez (ICASSP 2023). Architecture, SDR, sparse attention
- **Hosting API (Replicate)**: [cjwbw/demucs](https://replicate.com/cjwbw/demucs) — input schema and pricing for trying without building an environment

* Versions, parameters, pricing, and licenses get updated. **Always confirm the primary source before implementing.** This article's numbers (SDR 9.0 / 9.20 dB, segment max 7.8 seconds, 3GB/7GB VRAM, shifts +up to 0.2dB, CPU about 1.5× real time, etc.) are based on the official information at the time of writing.
