Category
音源分離・音声前処理(Demucs / UVR5 / ボーカル抽出 / ASR前処理)の実装ガイド
音源分離は『1本の音声を、声・ドラム・ベース・伴奏といった構成要素に分解する』技術です。カラオケ生成、BGMを残した動画の多言語吹き替え、雑音下の文字起こし精度向上、リミックスや耳コピ——応用は広い。本クラスタは、公開モデルでSOTA級のDemucs v4とボーカル分離特化のUVR5(MDX-Net)を軸に、要件からのツール選定、ASR前処理パイプライン、SDR/musevalでの品質評価、そしてGPUワーカー×ジョブキュー×冪等性の本番アーキテクチャまで——型安全・回復性・可観測性・コストを軸に、音源分離を本番で稼がせる設計を扱います。
12 articles in total
Foundational guide
Foundational guide (start here)
How to choose a source-separation tool: selecting Demucs / UVR5(MDX-Net) / Spleeter / Open-Unmix by requirements
A cross-comparison of the major music-source-separation OSS — Demucs v4, UVR5(MDX-Net), Spleeter, Open-Unmix — by quality, speed, license, setup difficulty, and memory. It explains, with real code, a decision framework you can reverse-look-up from requirements ('which to choose for which project') and the license pitfalls you must always confirm for commercial use.
Related practical articles
- 音源分離AWSGPUMLOpsSQS
Scaling audio source separation in production on AWS: a GPU batch-processing platform (SQS × ECS/Batch × S3)
Taking UVR5/MDX-Net and Demucs source separation from one-file-manual to production scale. It designs an idempotent queue-driven foundation of S3 event → SQS → GPU worker (AWS Batch / ECS) → S3, with concrete boto3 and Terraform code covering a visibility-timeout heartbeat, graceful termination on Spot interruption, DLQ, structured logs, and S3-key idempotency.
14 min read - BS-RoFormerMel-Band RoFormer音源分離ボーカル抽出AI音声
Complete guide to BS-RoFormer / Mel-Band RoFormer: using 2026's highest-quality source separation in production
An explanation, faithful to the official papers, of source separation's current SOTA: BS-RoFormer (Band-Split RoPE Transformer) and Mel-Band RoFormer. It shows the implementation needed for production: why it's the highest quality (band-split × RoPE Transformer), the SDX23 1st-place / MUSDB18HQ 9.80 dB track record, the execution code in audio-separator, the reality of VRAM/speed and OOM countermeasures, and choosing between it and MDX-Net.
9 min read - Demucs音源分離音声処理PythonGPU
Demucs v4 Complete Guide: Running Meta's Source-Separation Model (HT Demucs) in Production, Faithful to the Official Docs
An explanation of Meta's source-separation model Demucs v4 (HT Demucs), faithful to the official documentation (GitHub, paper). The mechanism of waveform × spectrogram × Transformer, how to choose among the htdemucs-family models, CLI and Python API implementation, real recipes for vocal separation / karaoke / ASR preprocessing / video localization, and long-audio OOM, idempotency, and resilience — the production-operations design shown with concrete code.
25 min read - 音源分離MLOpsアーキテクチャ設計PythonGPU
Turning source separation into a production API: the design of GPU worker × job queue × idempotency
Taking source separation like Demucs from a demo to a production service. It explains, in type-safe FastAPI ingress and Python worker real code, an architecture that puts heavy GPU processing on an asynchronous job queue and guarantees idempotency, resilience, observability, and cost efficiency. It covers the design needed for production operation, from OOM recovery, graceful shutdown, at-least-once delivery, GPU auto-scaling, to testability.
14 min read - 音源分離品質評価SDRテストPython
Measuring source-separation quality in numbers: SDR / museval and a CI quality gate
An explanation of how to evaluate source-separation quality not with the 'ear' but with numbers. With real code, it shows the design that ensures production quality: what BSSEval v4's SDR/ISR/SIR/SAR measure, the implementation with museval (the official evaluation tool), comparing on your own material, a quality gate that stops regressions in CI on model/parameter changes, and alternative metrics for real material with no reference.
10 min read - 音源分離リアルタイム低遅延ストリーミング音声処理
Is real-time source separation possible: the design and limits of low latency (the reality of streaming processing)
You want to do source separation (vocal/accompaniment separation) in real time, at low latency — this honestly explains its feasibility from the breakdown of latency and the characteristics of each model. Why MDX-Net/Demucs/RoFormer are inherently batch-oriented, the design and quality trade-offs to get close with chunk/streaming processing, the difference from noise suppression, and how to discern 'do you really need real-time' — it shows the reality of implementation.
9 min read - 音源分離Whisper文字起こし音声処理Python
Raising Whisper transcription accuracy with source separation: designing an audio-preprocessing pipeline
An explanation of how to lift the transcription accuracy of audio with BGM or noise by preprocessing with source separation (Demucs / UVR5). It shows the pipeline of vocal extraction → 16kHz normalization → VAD → Whisper in real code, and covers the production-operation design: when it works and when it backfires, measured WER with jiwer, idempotency, cost, and observability.
10 min read - 音源分離TTSASRデータセット音声前処理
Building TTS/ASR training data with source separation: a preprocessing pipeline for clean speech datasets
Explains how to mass-produce TTS/ASR training data by cleaning it with source separation (UVR5/Demucs). With real code it shows the pipeline of BGM/noise removal → resample → VAD splitting → quality gate → manifest generation, and covers when separation works and when it backfires, quality judgment by residual energy, idempotency and cost, and the consent and license governance of audio data — comprehensively for production data-foundation design.
10 min read - UVR5audio-separatorGPUCUDAONNX
Complete UVR5 / audio-separator troubleshooting guide (GPU not used, CUDA, OOM, installation)
'GPU isn't used and it's painfully slow,' 'CUDA out of memory,' 'cuDNN errors,' 'ffmpeg is missing,' 'the model is downloaded every time' — solves the symptoms commonly stuck on in source separation with UVR5 and audio-separator, by symptom, based on ONNX Runtime/PyTorch official facts, from diagnostic commands to concrete fix procedures.
11 min read - UVR5カラオケボーカル抽出音源分離アカペラ
Complete guide to making karaoke tracks and a cappella with UVR5: instrumental extraction / vocal extraction / harmony removal
A practical guide to making karaoke tracks (instrumental), a cappella (vocals), and harmony removal from songs with UVR5 (MDX-Net). It explains, in a form you can make today even as a first-timer: recommended models per use (Inst-type/Vocal-type/KARA_2), both GUI and code procedures, tips to raise audio quality, batch processing of an album, and the easy-to-miss copyright cautions.
7 min read - UVR5MDX-Net音源分離ボーカル抽出Python
UVR5 (MDX-Net) Complete Guide: Separating Vocals/Accompaniment with High Accuracy and Automating It in Production, Faithful to Official Sources
Explaining the open-source source-separation tool UVR5 and the MDX-Net architecture faithfully to official information (GitHub, arXiv papers). From trying it in the GUI to code automation with python-audio-separator, model selection (Inst/Vocal/Karaoke), tuning of segment_size etc., to OOM, CPU fallback, idempotency, and observability—it shows production-operation implementation in concrete code.
27 min read