# Building TTS/ASR training data with source separation: a preprocessing pipeline for clean speech datasets

> Explains how to mass-produce TTS/ASR training data by cleaning it with source separation (UVR5/Demucs). With real code it shows the pipeline of BGM/noise removal → resample → VAD splitting → quality gate → manifest generation, and covers when separation works and when it backfires, quality judgment by residual energy, idempotency and cost, and the consent and license governance of audio data — comprehensively for production data-foundation design.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 音源分離, TTS, ASR, データセット, 音声前処理, Python, MLOps, AI音声
- URL: https://tomodahinata.com/en/blog/source-separation-tts-asr-training-data-preprocessing
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- The quality of training data decides the quality of TTS/ASR. Training on audio with BGM or noise lowers quality, so extract only the voice with source separation to build a clean corpus.
- The pipeline is 'separate (extract voice) → resample (ASR=16k/TTS=22-24k) → drop silence/non-speech with VAD → quality gate → dedup → manifest generation.' Build each stage idempotent and reproducible.
- The purpose differs from improving transcription accuracy at inference time (separate article). This article is about building a training dataset. Separation artifacts can harm TTS naturalness and are unnecessary for clean recordings.
- The quality gate auto-selects by residual music energy, silence ratio, length, and (on a held-out set with ground truth) SI-SDR. Machine-reject the obviously bad.
- For audio data, consent and license are most important. Especially for TTS/voice-cloning purposes, build the speaker's explicit consent and PII management into the design.

---

## The goal of this article

When training or fine-tuning a TTS (speech synthesis) or ASR (speech recognition) model, the biggest factor that decides quality is **not the model or the hyperparameters, but the "quality" of the training data.** Train on audio mixed with BGM, laughter, or environmental sound, and the output drags it along.

What works there is **preprocessing by source separation** — **extract only the voice from songs or videos and mass-produce a clean speech corpus.** This article shows that **dataset-building pipeline** with design and working code. By the time you finish reading, you can assemble the following.

1. Design a preprocessing foundation of **collected audio → separation → resample → VAD splitting → quality gate → manifest.**
2. Judge **when separation works and when it backfires**, and avoid wasted processing.
3. Weave **the consent/license of audio data** into the data-foundation design from the start.

> 📐 **This article's position**: preprocessing to "**transcribe the audio at hand more accurately**" is handled in the [Whisper accuracy-improvement article](/blog/source-separation-asr-preprocessing-whisper-accuracy). This article's purpose differs — it's about **building a clean dataset to train a model.** Inference preprocessing and data building are similar but not the same — this article handles the data-foundation side.

> **About the author**: I have **single-handedly designed, implemented, and operate in production an AI audio/video foundation** with audio separation as the first stage. Training-data preprocessing is the "plain but most effective" process that lifts model performance. This article is that implementation know-how. For choosing a separation model, see the [selection guide](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter).

---

## 30-second summary (conclusion first)

| Point | Conclusion |
| --- | --- |
| **Why separate** | Training on audio with BGM/noise degrades quality. **Extract only the voice** to clean it |
| **Pipeline** | separate (voice) → resample → VAD splitting → quality gate → dedup → manifest |
| **Resample** | **ASR=16kHz mono**, **TTS=22.05/24kHz** is common (match the target model) |
| **Quality gate** | auto-select by residual music energy, silence ratio, length, and (if ground truth) SI-SDR |
| **When to use separation** | effective for material with BGM/noise. **Unnecessary for clean studio recordings** (artifact-injection risk) |
| **Model** | for clean-priority, a high-quality model ([RoFormer](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide)); for volume-priority, [MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) |
| **Governance** | **consent, license, PII** are most important. Especially TTS/voice-cloning requires speaker consent |

---

## Why the quality of training data decides everything

The iron rule of machine learning, "**garbage in, garbage out**," is especially pronounced with audio.

- **TTS**: if the training audio is mixed with BGM, the synthesized voice also shows **a bleed of the background and an unnatural texture.** You want to train cleanly on just the speaker's voice.
- **ASR**: training on data mixed with noise/music **raises robustness to noise, but can affect accuracy on clean audio and the stability of transcription.** Mix clean/noisy by design according to the use.

When **building a speech corpus from "voice + BGM" material** like YouTube, podcasts, or movies, source separation works as **preprocessing that pulls out only the voice.** Conversely, for **clean studio recordings from the start**, separation is unnecessary, and applying it can actually lower quality due to separation artifacts (discussed later).

---

## The whole pipeline

```text
collected audio (voice+BGM)
   │
   ├─① source separation → extract only the vocal (voice)   [UVR5 / RoFormer / Demucs]
   │
   ├─② resample / make mono                                 [ASR=16k / TTS=22.05k or 24k]
   │
   ├─③ drop silence/non-speech with VAD, split into utterances  [Silero VAD, etc.]
   │
   ├─④ quality gate (machine-reject bad ones)               [residual music / silence ratio / length / SI-SDR]
   │
   ├─⑤ deduplication / near-dup removal                     [hash / acoustic fingerprint]
   │
   └─⑥ manifest generation (audio↔meta mapping)             [JSONL / CSV]
```

Build each stage **idempotent and reproducible** (same output for the same input). So that even if it crashes midway, finished stages are skipped and it can resume.

---

## Implementation: separate → resample → VAD → quality gate → manifest

### ① Extract the voice (batch)

```python
# extract_vocals.py — 収集音声から声だけを抽出（モデルは1回ロードして使い回す）
from pathlib import Path
from audio_separator.separator import Separator

sep = Separator(output_dir="stage1_vocals", output_format="flac",
                output_single_stem="Vocals")          # 声だけ書き出し
sep.load_model(model_filename="Kim_Vocal_2.onnx")     # 量産はMDX系が軽快

def extract(src: Path) -> str:
    return sep.separate(str(src))[0]                   # 抽出した声のパス
```

If cleanness is the top priority, switch the model to [BS-RoFormer](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide) (heavy but high quality). For high-volume processing, take speed with the MDX family — decide the **quality-vs-volume trade-off** by use.

### ② Resample & make mono

**Match the sample rate** to the model being trained. Commonly, **ASR is 16kHz mono**, **TTS is 22.05kHz or 24kHz.**

```python
# normalize.py — 目的に合わせてリサンプル/モノラル化
import subprocess

def to_target(src: str, dst: str, sr: int) -> None:
    # ffmpegで確実に。-ac 1=モノラル, -ar=サンプルレート
    subprocess.run(["ffmpeg", "-y", "-i", src, "-ac", "1", "-ar", str(sr), dst],
                   check=True)

# ASR用: to_target(v, "asr/clip.wav", 16000)
# TTS用: to_target(v, "tts/clip.wav", 24000)
```

### ③ Split into utterances with VAD

Cut long audio into **utterance (speech-segment) units** and drop silence/non-speech. This avoids mixing "silence-filled clips" into training.

```python
# segment.py — VADで発話区間に分割（例: Silero VAD）
import torch

model, utils = torch.hub.load("snakers4/silero-vad", "silero_vad")
(get_speech_timestamps, _, read_audio, *_), = (utils,)

def speech_segments(wav_path: str, sr: int = 16000):
    wav = read_audio(wav_path, sampling_rate=sr)
    ts = get_speech_timestamps(wav, model, sampling_rate=sr)  # 発話区間
    return [(t["start"], t["end"]) for t in ts]              # サンプル位置の配列
```

### ④ Quality gate (machine-reject bad ones)

This is **the key to protecting the dataset's quality.** Automatically exclude obviously bad clips.

```python
# quality_gate.py — クリップを機械的に選別する
import numpy as np
import soundfile as sf

def passes_quality(path: str, *, min_sec=1.0, max_sec=15.0,
                   max_silence_ratio=0.5) -> bool:
    audio, sr = sf.read(path)
    dur = len(audio) / sr
    if not (min_sec <= dur <= max_sec):
        return False                                   # 短すぎ/長すぎを除外
    # 無音率：振幅が極小なサンプルの割合
    silence = float(np.mean(np.abs(audio) < 1e-3))
    if silence > max_silence_ratio:
        return False                                   # 無音だらけを除外
    return True
```

On a **held-out set** that has the ground truth (original stems), you can score the separation quality itself with `SI-SDR` and reject below a threshold (how to measure is in the [quality-evaluation guide](/blog/music-source-separation-quality-evaluation-sdr-museval)). For production material without ground truth, use the **energy ratio of music left on the accompaniment side** as a proxy indicator of "separation leakage."

### ⑤⑥ Dedup & manifest generation

```python
# manifest.py — 重複を除き、学習フレームワークが読む対応表を作る
import hashlib, json
from pathlib import Path

def content_hash(path: Path) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        for b in iter(lambda: f.read(1 << 20), b""):
            h.update(b)
    return h.hexdigest()

def build_manifest(clips_dir: str, out_jsonl: str) -> int:
    seen: set[str] = set()
    n = 0
    with open(out_jsonl, "w", encoding="utf-8") as w:
        for clip in sorted(Path(clips_dir).glob("*.wav")):
            key = content_hash(clip)
            if key in seen:
                continue                               # 完全重複を除外
            seen.add(key)
            w.write(json.dumps({"audio": str(clip)}, ensure_ascii=False) + "\n")
            n += 1
    return n                                            # 採用クリップ数
```

For ASR, map **the transcript text** here (by hand or generated with [Whisper](/blog/openai-whisper-production-guide-selfhost-vs-api) → reviewed); for TTS, map **the speaker ID and text.**

---

## When separation works and when it backfires (the honest talk)

Source separation is not omnipotent. There are also **situations where it's better not to apply it.**

| Material | Should you separate? |
| --- | --- |
| YouTube/podcast (voice+BGM) | ✅ works. Extract the voice to clean it |
| Movie/drama (voice+SFX+music) | ✅ works (difficulty rises) |
| **Clean studio-recorded audio** | ❌ unnecessary. Separation artifacts may **degrade it instead** |
| Already voice-only narration | ❌ unnecessary |

> 💡 **Especially careful with TTS**: separation can leave subtle artifacts (high-frequency bleed, etc.), and that affects the **naturalness of the synthesized voice.** Don't separate material you can record cleanly; the correct positioning is that **separation is a means to "rescue dirty material."**

---

## Cost, idempotency, observability

A data foundation that processes a large amount of audio is built with the same discipline as a production system.

- **Idempotency**: name each stage's output deterministically by the **content hash of the input** and skip finished ones. Don't double-process on re-run.
- **Cost**: separation is the heaviest process, eating GPU. **Separate only the necessary material** (pass clean material through) and choose model weight by use.
- **Observability**: record each stage's **pass count, exclusion count, and exclusion reason.** Make it so you can later trace "why this clip was dropped."
- **Scale**: put over a few thousand items on a [GPU-worker foundation](/blog/music-source-separation-production-api-gpu-worker-queue) or [AWS batch](/blog/audio-source-separation-aws-gpu-batch-pipeline). For swamps like GPU not working, go to [troubleshooting](/blog/uvr5-audio-separator-troubleshooting-gpu-cuda-oom).

---

## ⚠️ Data governance: consent and license are most important

Training data is, **before technology, a governance problem.** Miss this and the model you built itself becomes unusable.

- **Source license**: confirm whether the **training use is permitted** by the collection source (commercial songs, distributed audio, etc.). Training on copyrighted works without permission is a legal risk.
- **Speaker consent (especially TTS/voice-cloning)**: if you train TTS on a real speaker's voice, **the person's explicit consent** is required. The voice is information directly tied to personhood — build consent, purpose, and a means of withdrawal into the design (the way of thinking about consent governance is in the [voice-cloning consent-design article](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide)).
- **PII**: stipulate the handling, storage, and masking of personal information contained in transcripts or audio.
- **Provenance**: leave each clip's **origin, license, and consent status** in the manifest to make it auditable.

---

## Frequently asked questions (FAQ)

**Q. What's the difference from transcription preprocessing at inference time?**
A. The purpose differs. The [Whisper accuracy-improvement article](/blog/source-separation-asr-preprocessing-whisper-accuracy) is about "**accurately transcribing the audio you have now.**" This article is about "**building a clean dataset to train a model.**" Even if the pipeline is similar, the exit (manifest, quality gate, governance) differs.

**Q. Which model should I separate with?**
A. **For clean-priority, [RoFormer](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide)**; **for volume/speed-priority, [MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide).** Decide by dataset scale and GPU budget.

**Q. What should the sample rate be?**
A. **Match the model being trained.** ASR (Whisper family) is commonly **16kHz mono**, TTS is **22.05/24kHz.** The safe flow is to do the intermediate separation at 44.1kHz and resample to the target SR at the end.

**Q. Should I apply separation to clean recordings too?**
A. **No.** Separation can leave artifacts, so it's unnecessary for clean material — it can lower quality instead. The standard is to limit separation to the use of "rescuing dirty material."

**Q. What does the quality gate judge by?**
A. Reject the obviously bad by **machine indicators** like length, silence ratio, and residual music energy, and on a held-out set with ground truth, score the separation quality with **SI-SDR** ([quality-evaluation guide](/blog/music-source-separation-quality-evaluation-sdr-museval)).

---

## Conclusion: before the model, prepare the data

The quality of TTS/ASR is **mostly decided by data preparation before training.**

1. **Extract the voice to clean it** (dirty material only. Pass clean recordings through).
2. **Resample to the target SR → VAD splitting → quality gate → dedup → manifest.**
3. Build it **idempotent, cost-optimal, observable**, and put scale on a GPU-batch foundation.
4. Weave **consent, license, PII, and provenance** into the data foundation from the start.

> The **data foundation before "training a model" is the divide of quality and legal safety.** If you want to build a clean speech-data foundation for TTS/ASR at production quality, please consult along with the [case study](/case-studies/ai-video-localization-lipsync). With **one person × generative AI**, I support end-to-end from data collection through training and operation.

---

## Sources / related resources

- **Separation library**: [nomadkaraoke/python-audio-separator (MIT)](https://github.com/nomadkaraoke/python-audio-separator)
- **VAD**: [snakers4/silero-vad](https://github.com/snakers4/silero-vad)
- **Audio I/O**: ffmpeg (resample/format conversion)
- **Quality evaluation**: this blog, [measure source-separation quality numerically](/blog/music-source-separation-quality-evaluation-sdr-museval)
- **Model selection**: this blog, [how to choose a source-separation tool](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter)

* Library APIs and recommended sample rates are updated. Confirm primary sources before implementing. The usability of training data, speaker consent, and PII handling depend on jurisdiction and service terms. Always proceed with data collection and use after confirming primary sources and with legal review.
