The goal of this article
Run source separation with Demucs or UVR5, and you'll surely be asked next — "So how good is the quality?" If you can only answer "it sounds good to my ear," you're stuck in production and in a project proposal.
This piece shows, in real code, how to measure source-separation quality in objective numbers. When you finish reading, the goal is a state where you can:
- Explain to others what the 4 metrics SDR/ISR/SIR/SAR measure.
- Quantify quality with museval and compare tools/parameters on the same footing.
- Automatically stop, with a CI quality gate, whether quality dropped on a model/parameter change.
About the author (reliability disclosure): I have single-handedly designed, implemented, and run in production an AI video-localization platform that fully automates everything up to multilingual localization just by uploading a video. In the first stage, source separation, back when I confirmed quality by ear alone, broken clips slipped through into production and dragged the downstream transcription and dubbing in. The evaluation design here is a record of rebuilding to "watch quality with numbers" out of that reflection.
30-second summary
| Aspect | Conclusion |
|---|---|
| Why quantify | By ear alone, you miss breakage in large batches. The good/bad of model/parameter changes can't be judged subjectively |
| Metrics to use | BSSEval v4: SDR (overall) / SIR (contamination, residue) / SAR (artifacts) / ISR (spatial distortion) |
| Tool | museval (the official evaluation tool). pip install museval. museval.evaluate(references, estimates) |
| Aggregation | Compute per frame → frame median → track median (robust to outliers) |
| Premise | Correct stems (reference) needed. Measure on a fixed eval set (MUSDB18 or your own labels) |
| CI quality gate | On a change, measure the eval set's SDR median, and fail the test if it falls below a threshold |
| Real material (no reference) | Reference-free alternative metrics + human sampling of low-score tracks |
| Caution | SDR is a relative value depending on data/version/evaluation method. Don't compare directly across papers |
The meaning of the metrics is from this chapter.
Why "the ear" alone isn't enough
Leave source-separation quality checking to the human ear and it always breaks down in three situations.
- Scale: in a batch processing thousands a day, it's impossible for a person to listen to every track. Broken clips always slip through.
- Regression: you updated the model, changed
shifts, switched tools — you can't subjectively judge "did it get better than before?" - Comparison: which of Demucs and UVR5 is better on your material. Without the same yardstick, the discussion becomes an unwinnable argument.
The solution is to make quality a reproducible number. That's BSSEval v4 and museval.
What the 4 metrics measure
The standard metrics for source separation are the 4 of BSSEval v4. museval computes them. All are in dB (higher is better). Restated intuitively —
| Metric | Full name | What it measures (intuition) |
|---|---|---|
| SDR | Source-to-Distortion Ratio | Overall quality. The strength of the target sound against all distortion. "Rough good/bad" is this |
| SIR | Source-to-Interference Ratio | Contamination from other sources. Whether drums leak into the vocals = the lack of residue |
| SAR | Source-to-Artifacts Ratio | Artifacts. The lack of unnatural noise / metallic sounds the separation produced |
| ISR | Image-to-Spatial-distortion Ratio | Spatial (localization) distortion. Whether the stereo left/right image is broken |
Key points for using them differently:
- For overall good/bad, SDR as the main metric.
- If "vocals remain in the instrumental" is the problem, look at SIR (the contamination metric).
- If "strange noise rides on the voice," look at SAR (the artifact metric).
In other words, "overall is SDR, isolating the symptom is SIR/SAR." This decomposes "somehow bad" into "is there a lot of contamination, or a lot of artifacts."
Measure with museval (minimal implementation)
museval is the official evaluation tool of SiSEC / the Music Demixing Challenge, used in combination with the MUSDB18 dataset. The core museval.evaluate works with just numpy arrays (musdb-independent).
pip install museval
import museval
import numpy as np
# references / estimates の形状はいずれも (n_sources, n_samples, n_channels)
# references = 正解ステム(ground truth), estimates = モデルの分離結果
SDR, ISR, SIR, SAR = museval.evaluate(references, estimates)
# 返り値は (n_sources, n_frames) の配列。曲のスコアは「フレーム中央値」で集計する
sdr_per_source = np.nanmedian(SDR, axis=1) # nan に強い中央値
print("vocals SDR:", sdr_per_source[0], "dB")
If using MUSDB18, the track-level high-level API is convenient.
import musdb, museval
mus = musdb.DB(root="/path/to/musdb18", subsets="test")
results = museval.EvalStore() # 複数トラックの結果を集計するストア
for track in mus:
estimates = run_my_separation(track.audio) # {"vocals": ndarray, "accompaniment": ndarray}
scores = museval.eval_mus_track(track, estimates) # 1曲を評価
results.add_track(scores)
# 集計の作法: frames_agg='median'(フレーム中央値)→ tracks_agg='median'(曲中央値)
print(results) # source別の SDR/SIR/SAR/ISR をまとめて表示
📌 Why aggregation is the median: there are silent or thin-sound sections mid-track, where SDR becomes an extreme value (sometimes −∞). The mean gets dragged by outliers, so museval aggregates frame median → track median. Always use the median when measuring yourself too.
Compare tools/parameters on the same footing
In the tool-selection article, I wrote "measure on your own material." This is that measurement. Apply candidate tools in turn to the same eval set and line up the SDR.
def benchmark(separators: dict, eval_set) -> dict[str, float]:
"""各ツールの『曲中央値SDR』を返す。同じ素材・同じ指標で公平に比較する。"""
results: dict[str, float] = {}
for name, separate in separators.items():
store = museval.EvalStore()
for track in eval_set:
store.add_track(museval.eval_mus_track(track, separate(track.audio)))
results[name] = store.df.query("metric == 'SDR'")["score"].median()
return results
scores = benchmark({"demucs": demucs_sep, "uvr_mdx": uvr_sep, "spleeter": spleeter_sep}, eval_set)
# 例: {"demucs": 8.9, "uvr_mdx": 8.4, "spleeter": 6.1} ← この素材ではDemucsが最良
With this, "I feel Demucs is good" turns into "Demucs is +0.5 dB higher on this material." Decision-making rides on numbers instead of subjectivity — and that becomes the persuasiveness of your proposal directly.
The CI quality gate: stop regressions by machine
This is the heart of this piece and the point I'm most evaluated on in projects. You updated the model, changed a parameter, bumped a library — at that time, automatically check in CI, not by a person, whether quality dropped.
# tests/test_separation_quality.py
import pytest
# 固定の評価セット(小さくてよい:5〜10曲、正解ステム付き)でSDRを測り、
# 閾値を割ったらテストを失敗させる=「前より悪い変更」をmainに入れない
QUALITY_THRESHOLDS_DB = {"vocals": 7.5, "drums": 7.0, "bass": 6.5, "other": 4.5}
@pytest.mark.parametrize("source, floor", QUALITY_THRESHOLDS_DB.items())
def test_separation_quality_does_not_regress(source: str, floor: float):
sdr = evaluate_fixed_set(model=load_model()) # {source: 曲中央値SDR}
assert sdr[source] >= floor, (
f"{source} SDR が退行: {sdr[source]:.2f} dB < 閾値 {floor} dB"
)
What this achieves is guaranteeing in code not just "it works" but "it hasn't gotten worse than before." The CLAUDE.md principle — prepare a verification path before implementing — can be applied even to a domain like source separation that looks subjective. Give the threshold margin as an MVP and raise it with each improvement (ratchet).
💡 How to make the eval set: using MUSDB18's test is the royal road. For an internal project, prepare correct stems by hand for 5–10 of the customer's representative materials and make it a fixed set. It can be small — detecting regressions achieves the goal.
How to watch real material with no reference
museval's premise is correct stems (reference). But real material flowing into production usually has no correct answer. Here, switch to realistic alternatives.
- Reference-free alternative metric: e.g., roughly estimate "the energy ratio of vocal components remaining in the instrumental track." A lot of residue = low quality, as an approximation. Not perfect, but sufficient for outlier detection.
- Human sampling: every track is impossible, but have a person spot-check a random sample or only the ones suspected of being low-score. Lining them up by suspicion via the alternative metric drastically cuts review effort.
- Production monitoring: emit the alternative metric per process to structured logs + metrics, and alert on a distribution anomaly (sudden worsening). For incorporating observability, see the production-API article.
def vocal_bleed_proxy(no_vocals: np.ndarray, vocals: np.ndarray) -> float:
"""伴奏に残るボーカル成分の近似。低いほど良い(消し残りが少ない)。
正解ステムが無い実素材の『リファレンスフリーな品質プロキシ』として使う。"""
# 相関ベースで no_vocals に vocals がどれだけ漏れているかを推定
return float(np.corrcoef(no_vocals.flatten(), vocals.flatten())[0, 1] ** 2)
The trade-off: the goal for real-material quality is "detecting worsening" rather than "an accurate absolute value." An alternative metric is useful enough.
Cautions when reading SDR (traps)
- A relative value depending on data/version: SDR changes with the evaluation dataset, model version, and evaluation method (frame aggregation). Don't directly compare paper A's 9.2 dB with article B's number. Always compare with the same eval set and the same settings.
- −∞ for silent sources: in sections where that source isn't sounding mid-track, SDR goes to extreme values. Absorb it with median aggregation.
- Alignment: if the estimate and reference differ in length/phase, it comes out unfairly low. Align them in preprocessing.
- SDR isn't everything: perception is also affected by SIR/SAR. View it three-dimensionally with main metric SDR + per-symptom SIR/SAR.
Frequently asked questions (FAQ)
Q. How much SDR is "good"? A. There's no absolute standard (data-dependent). As a rule of thumb, for vocals in a MUSDB-family evaluation, the 7–9 dB range is high quality. But relative comparison on your own eval set is the essence.
Q. Is museval heavy? A. It computes a fair amount. A CI quality gate is fine with a small fixed set (5–10 tracks). You don't need to run all of MUSDB every time.
Q. How do I prepare correct stems? A. MUSDB18 comes with stems. An internal project can make them if there's a multitrack source (DAW tracks). If not, a few by hand.
Q. What about PESQ or SI-SDR? A. PESQ leans toward speech quality (telephony), and SI-SDR is a scale-invariant version of SDR used often recently. museval (BSSEval v4) is the de facto standard for source separation, so aligning on this first is safe.
Q. Is confirming by ear no longer needed? A. Using both is the right answer. Watch all tracks with numbers, and confirm only the suspicious ones by ear. Ear and numbers have different roles.
Conclusion: quality from "subjectivity" to "numbers"
Source separation looks, at a glance, subjective to evaluate. But — there are established metrics: SDR/ISR/SIR/SAR, the official tool museval, and you can stop regressions with a CI quality gate. Not ending at "it works," but taking it to "measurable, comparable, non-regressing" is production quality.
- Quantify SDR with museval and aggregate frame median → track median.
- Compare tools/parameters on the same eval set and put decision-making on numbers.
- With a CI quality gate, mechanically stop regressions on changes.
- Monitor real material in production with an alternative metric + human sampling.
And — this stance of "guaranteeing quality with numbers" is where outsourcing produces the most trust. "It's probably good" and "SDR median 8.9 dB on the eval set, regression-gated in CI" carry entirely different weight in a proposal.
I've incorporated the evaluation and quality gate here into the AI video-localization platform I actually run in production. Not limited to source separation, if you're considering building a design/test foundation that guarantees AI-processing quality with numbers, take a look at my track record and consult me. With one person × generative AI, I build it out to verifiable quality.
Sources / related resources
- Evaluation tool: sigsep/sigsep-mus-eval (museval) — BSSEval v4 (SDR/ISR/SIR/SAR)
- Dataset: MUSDB18 — standard evaluation data with correct stems
- Related: tool selection / Demucs / UVR5・MDX-Net / production-API design
- SDR is a relative value depending on evaluation conditions. Always compare with the same eval set and the same settings, and don't directly compare numbers across papers.