Skip to main content
友田 陽大
Audio source separation & preprocessing
音源分離
TTS
ASR
データセット
音声前処理
Python
MLOps
AI音声

Building TTS/ASR training data with source separation: a preprocessing pipeline for clean speech datasets

Explains how to mass-produce TTS/ASR training data by cleaning it with source separation (UVR5/Demucs). With real code it shows the pipeline of BGM/noise removal → resample → VAD splitting → quality gate → manifest generation, and covers when separation works and when it backfires, quality judgment by residual energy, idempotency and cost, and the consent and license governance of audio data — comprehensively for production data-foundation design.

Published
Reading time
10 min read
Author
友田 陽大
Share

The goal of this article

When training or fine-tuning a TTS (speech synthesis) or ASR (speech recognition) model, the biggest factor that decides quality is not the model or the hyperparameters, but the "quality" of the training data. Train on audio mixed with BGM, laughter, or environmental sound, and the output drags it along.

What works there is preprocessing by source separationextract only the voice from songs or videos and mass-produce a clean speech corpus. This article shows that dataset-building pipeline with design and working code. By the time you finish reading, you can assemble the following.

  1. Design a preprocessing foundation of collected audio → separation → resample → VAD splitting → quality gate → manifest.
  2. Judge when separation works and when it backfires, and avoid wasted processing.
  3. Weave the consent/license of audio data into the data-foundation design from the start.

📐 This article's position: preprocessing to "transcribe the audio at hand more accurately" is handled in the Whisper accuracy-improvement article. This article's purpose differs — it's about building a clean dataset to train a model. Inference preprocessing and data building are similar but not the same — this article handles the data-foundation side.

About the author: I have single-handedly designed, implemented, and operate in production an AI audio/video foundation with audio separation as the first stage. Training-data preprocessing is the "plain but most effective" process that lifts model performance. This article is that implementation know-how. For choosing a separation model, see the selection guide.


30-second summary (conclusion first)

PointConclusion
Why separateTraining on audio with BGM/noise degrades quality. Extract only the voice to clean it
Pipelineseparate (voice) → resample → VAD splitting → quality gate → dedup → manifest
ResampleASR=16kHz mono, TTS=22.05/24kHz is common (match the target model)
Quality gateauto-select by residual music energy, silence ratio, length, and (if ground truth) SI-SDR
When to use separationeffective for material with BGM/noise. Unnecessary for clean studio recordings (artifact-injection risk)
Modelfor clean-priority, a high-quality model (RoFormer); for volume-priority, MDX-Net
Governanceconsent, license, PII are most important. Especially TTS/voice-cloning requires speaker consent

Why the quality of training data decides everything

The iron rule of machine learning, "garbage in, garbage out," is especially pronounced with audio.

  • TTS: if the training audio is mixed with BGM, the synthesized voice also shows a bleed of the background and an unnatural texture. You want to train cleanly on just the speaker's voice.
  • ASR: training on data mixed with noise/music raises robustness to noise, but can affect accuracy on clean audio and the stability of transcription. Mix clean/noisy by design according to the use.

When building a speech corpus from "voice + BGM" material like YouTube, podcasts, or movies, source separation works as preprocessing that pulls out only the voice. Conversely, for clean studio recordings from the start, separation is unnecessary, and applying it can actually lower quality due to separation artifacts (discussed later).


The whole pipeline

collected audio (voice+BGM)
   │
   ├─① source separation → extract only the vocal (voice)   [UVR5 / RoFormer / Demucs]
   │
   ├─② resample / make mono                                 [ASR=16k / TTS=22.05k or 24k]
   │
   ├─③ drop silence/non-speech with VAD, split into utterances  [Silero VAD, etc.]
   │
   ├─④ quality gate (machine-reject bad ones)               [residual music / silence ratio / length / SI-SDR]
   │
   ├─⑤ deduplication / near-dup removal                     [hash / acoustic fingerprint]
   │
   └─⑥ manifest generation (audio↔meta mapping)             [JSONL / CSV]

Build each stage idempotent and reproducible (same output for the same input). So that even if it crashes midway, finished stages are skipped and it can resume.


Implementation: separate → resample → VAD → quality gate → manifest

① Extract the voice (batch)

# extract_vocals.py — 収集音声から声だけを抽出(モデルは1回ロードして使い回す)
from pathlib import Path
from audio_separator.separator import Separator

sep = Separator(output_dir="stage1_vocals", output_format="flac",
                output_single_stem="Vocals")          # 声だけ書き出し
sep.load_model(model_filename="Kim_Vocal_2.onnx")     # 量産はMDX系が軽快

def extract(src: Path) -> str:
    return sep.separate(str(src))[0]                   # 抽出した声のパス

If cleanness is the top priority, switch the model to BS-RoFormer (heavy but high quality). For high-volume processing, take speed with the MDX family — decide the quality-vs-volume trade-off by use.

② Resample & make mono

Match the sample rate to the model being trained. Commonly, ASR is 16kHz mono, TTS is 22.05kHz or 24kHz.

# normalize.py — 目的に合わせてリサンプル/モノラル化
import subprocess

def to_target(src: str, dst: str, sr: int) -> None:
    # ffmpegで確実に。-ac 1=モノラル, -ar=サンプルレート
    subprocess.run(["ffmpeg", "-y", "-i", src, "-ac", "1", "-ar", str(sr), dst],
                   check=True)

# ASR用: to_target(v, "asr/clip.wav", 16000)
# TTS用: to_target(v, "tts/clip.wav", 24000)

③ Split into utterances with VAD

Cut long audio into utterance (speech-segment) units and drop silence/non-speech. This avoids mixing "silence-filled clips" into training.

# segment.py — VADで発話区間に分割(例: Silero VAD)
import torch

model, utils = torch.hub.load("snakers4/silero-vad", "silero_vad")
(get_speech_timestamps, _, read_audio, *_), = (utils,)

def speech_segments(wav_path: str, sr: int = 16000):
    wav = read_audio(wav_path, sampling_rate=sr)
    ts = get_speech_timestamps(wav, model, sampling_rate=sr)  # 発話区間
    return [(t["start"], t["end"]) for t in ts]              # サンプル位置の配列

④ Quality gate (machine-reject bad ones)

This is the key to protecting the dataset's quality. Automatically exclude obviously bad clips.

# quality_gate.py — クリップを機械的に選別する
import numpy as np
import soundfile as sf

def passes_quality(path: str, *, min_sec=1.0, max_sec=15.0,
                   max_silence_ratio=0.5) -> bool:
    audio, sr = sf.read(path)
    dur = len(audio) / sr
    if not (min_sec <= dur <= max_sec):
        return False                                   # 短すぎ/長すぎを除外
    # 無音率:振幅が極小なサンプルの割合
    silence = float(np.mean(np.abs(audio) < 1e-3))
    if silence > max_silence_ratio:
        return False                                   # 無音だらけを除外
    return True

On a held-out set that has the ground truth (original stems), you can score the separation quality itself with SI-SDR and reject below a threshold (how to measure is in the quality-evaluation guide). For production material without ground truth, use the energy ratio of music left on the accompaniment side as a proxy indicator of "separation leakage."

⑤⑥ Dedup & manifest generation

# manifest.py — 重複を除き、学習フレームワークが読む対応表を作る
import hashlib, json
from pathlib import Path

def content_hash(path: Path) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        for b in iter(lambda: f.read(1 << 20), b""):
            h.update(b)
    return h.hexdigest()

def build_manifest(clips_dir: str, out_jsonl: str) -> int:
    seen: set[str] = set()
    n = 0
    with open(out_jsonl, "w", encoding="utf-8") as w:
        for clip in sorted(Path(clips_dir).glob("*.wav")):
            key = content_hash(clip)
            if key in seen:
                continue                               # 完全重複を除外
            seen.add(key)
            w.write(json.dumps({"audio": str(clip)}, ensure_ascii=False) + "\n")
            n += 1
    return n                                            # 採用クリップ数

For ASR, map the transcript text here (by hand or generated with Whisper → reviewed); for TTS, map the speaker ID and text.


When separation works and when it backfires (the honest talk)

Source separation is not omnipotent. There are also situations where it's better not to apply it.

MaterialShould you separate?
YouTube/podcast (voice+BGM)✅ works. Extract the voice to clean it
Movie/drama (voice+SFX+music)✅ works (difficulty rises)
Clean studio-recorded audio❌ unnecessary. Separation artifacts may degrade it instead
Already voice-only narration❌ unnecessary

💡 Especially careful with TTS: separation can leave subtle artifacts (high-frequency bleed, etc.), and that affects the naturalness of the synthesized voice. Don't separate material you can record cleanly; the correct positioning is that separation is a means to "rescue dirty material."


Cost, idempotency, observability

A data foundation that processes a large amount of audio is built with the same discipline as a production system.

  • Idempotency: name each stage's output deterministically by the content hash of the input and skip finished ones. Don't double-process on re-run.
  • Cost: separation is the heaviest process, eating GPU. Separate only the necessary material (pass clean material through) and choose model weight by use.
  • Observability: record each stage's pass count, exclusion count, and exclusion reason. Make it so you can later trace "why this clip was dropped."
  • Scale: put over a few thousand items on a GPU-worker foundation or AWS batch. For swamps like GPU not working, go to troubleshooting.

Training data is, before technology, a governance problem. Miss this and the model you built itself becomes unusable.

  • Source license: confirm whether the training use is permitted by the collection source (commercial songs, distributed audio, etc.). Training on copyrighted works without permission is a legal risk.
  • Speaker consent (especially TTS/voice-cloning): if you train TTS on a real speaker's voice, the person's explicit consent is required. The voice is information directly tied to personhood — build consent, purpose, and a means of withdrawal into the design (the way of thinking about consent governance is in the voice-cloning consent-design article).
  • PII: stipulate the handling, storage, and masking of personal information contained in transcripts or audio.
  • Provenance: leave each clip's origin, license, and consent status in the manifest to make it auditable.

Frequently asked questions (FAQ)

Q. What's the difference from transcription preprocessing at inference time? A. The purpose differs. The Whisper accuracy-improvement article is about "accurately transcribing the audio you have now." This article is about "building a clean dataset to train a model." Even if the pipeline is similar, the exit (manifest, quality gate, governance) differs.

Q. Which model should I separate with? A. For clean-priority, RoFormer; for volume/speed-priority, MDX-Net. Decide by dataset scale and GPU budget.

Q. What should the sample rate be? A. Match the model being trained. ASR (Whisper family) is commonly 16kHz mono, TTS is 22.05/24kHz. The safe flow is to do the intermediate separation at 44.1kHz and resample to the target SR at the end.

Q. Should I apply separation to clean recordings too? A. No. Separation can leave artifacts, so it's unnecessary for clean material — it can lower quality instead. The standard is to limit separation to the use of "rescuing dirty material."

Q. What does the quality gate judge by? A. Reject the obviously bad by machine indicators like length, silence ratio, and residual music energy, and on a held-out set with ground truth, score the separation quality with SI-SDR (quality-evaluation guide).


Conclusion: before the model, prepare the data

The quality of TTS/ASR is mostly decided by data preparation before training.

  1. Extract the voice to clean it (dirty material only. Pass clean recordings through).
  2. Resample to the target SR → VAD splitting → quality gate → dedup → manifest.
  3. Build it idempotent, cost-optimal, observable, and put scale on a GPU-batch foundation.
  4. Weave consent, license, PII, and provenance into the data foundation from the start.

The data foundation before "training a model" is the divide of quality and legal safety. If you want to build a clean speech-data foundation for TTS/ASR at production quality, please consult along with the case study. With one person × generative AI, I support end-to-end from data collection through training and operation.


  • Library APIs and recommended sample rates are updated. Confirm primary sources before implementing. The usability of training data, speaker consent, and PII handling depend on jurisdiction and service terms. Always proceed with data collection and use after confirming primary sources and with legal review.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading