# Measuring source-separation quality in numbers: SDR / museval and a CI quality gate

> An explanation of how to evaluate source-separation quality not with the 'ear' but with numbers. With real code, it shows the design that ensures production quality: what BSSEval v4's SDR/ISR/SIR/SAR measure, the implementation with museval (the official evaluation tool), comparing on your own material, a quality gate that stops regressions in CI on model/parameter changes, and alternative metrics for real material with no reference.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 音源分離, 品質評価, SDR, テスト, Python, MLOps, 可観測性
- URL: https://tomodahinata.com/en/blog/music-source-separation-quality-evaluation-sdr-museval
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- Source-separation quality can't be protected by ear alone. In large batches, breakage always slips through, and the good/bad of model/parameter changes can't be judged subjectively. Quantify with BSSEval v4's SDR/ISR/SIR/SAR.
- The meaning of the 4 metrics: SDR = overall quality, SIR = contamination from other sources (residue), SAR = unnatural artifacts, ISR = localization/spatial distortion. Compute with museval (pip install museval), aggregating frame median → track median.
- museval needs the correct stems (reference). Measure tool selection and regression detection on a fixed evaluation set (MUSDB18 or your own labels). museval.evaluate(references, estimates) works with just numpy arrays.
- Stop quality drops on model/parameter changes with a CI quality gate. If the fixed eval set's SDR median falls below a threshold, the test fails — guaranteeing by machine not just 'it works' but 'it hasn't gotten worse than before.'
- Real material usually has no correct stems. Monitor production with reference-free alternative metrics like a residual-energy ratio + human sampling of low-score tracks. SDR is a relative value depending on data/version, not directly comparable across papers.

---

## The goal of this article

Run source separation with [Demucs](/blog/demucs-v4-music-source-separation-production-guide) or [UVR5](/blog/uvr5-mdx-net-vocal-separation-production-guide), and you'll surely be asked next — **"So how good is the quality?"** If you can only answer "it sounds good to my ear," you're stuck in production and in a project proposal.

This piece shows, in real code, how to **measure source-separation quality in objective numbers.** When you finish reading, the goal is a state where you can:

1. Explain to others what the 4 metrics **SDR/ISR/SIR/SAR** measure.
2. Quantify quality with **museval** and **compare tools/parameters on the same footing.**
3. Automatically stop, with a **CI quality gate**, whether quality dropped on a model/parameter change.

> **About the author (reliability disclosure)**: I have **single-handedly designed, implemented, and run in production an AI video-localization platform** that fully automates everything up to multilingual localization just by uploading a video. In the first stage, source separation, **back when I confirmed quality by ear alone, broken clips slipped through into production** and dragged the downstream transcription and dubbing in. The evaluation design here is a record of rebuilding to "watch quality with numbers" out of that reflection.

---

## 30-second summary

| Aspect | Conclusion |
| --- | --- |
| **Why quantify** | By ear alone, you miss breakage in large batches. The good/bad of model/parameter changes can't be judged subjectively |
| **Metrics to use** | **BSSEval v4**: SDR (overall) / SIR (contamination, residue) / SAR (artifacts) / ISR (spatial distortion) |
| **Tool** | **museval** (the official evaluation tool). `pip install museval`. `museval.evaluate(references, estimates)` |
| **Aggregation** | Compute per frame → **frame median → track median** (robust to outliers) |
| **Premise** | **Correct stems (reference) needed.** Measure on a fixed eval set (MUSDB18 or your own labels) |
| **CI quality gate** | On a change, measure the eval set's SDR median, and **fail the test if it falls below a threshold** |
| **Real material (no reference)** | Reference-free alternative metrics + human sampling of low-score tracks |
| **Caution** | SDR is a relative value depending on data/version/evaluation method. **Don't compare directly across papers** |

The meaning of the metrics is from [this chapter](#what-the-4-metrics-measure).

---

## Why "the ear" alone isn't enough

Leave source-separation quality checking to the human ear and it always breaks down in three situations.

- **Scale**: in a batch processing thousands a day, it's impossible for a person to listen to every track. **Broken clips always slip through.**
- **Regression**: you updated the model, changed `shifts`, switched tools — **you can't subjectively judge "did it get better than before?"**
- **Comparison**: which of Demucs and UVR5 is better on your material. Without **the same yardstick**, the discussion becomes an unwinnable argument.

The solution is to **make quality a reproducible number.** That's BSSEval v4 and museval.

---

## What the 4 metrics measure

The standard metrics for source separation are the 4 of **BSSEval v4.** `museval` computes them. All are in **dB (higher is better).** Restated intuitively —

| Metric | Full name | What it measures (intuition) |
| --- | --- | --- |
| **SDR** | Source-to-Distortion Ratio | **Overall quality.** The strength of the target sound against all distortion. "Rough good/bad" is this |
| **SIR** | Source-to-Interference Ratio | **Contamination from other sources.** Whether drums leak into the vocals = the lack of **residue** |
| **SAR** | Source-to-Artifacts Ratio | **Artifacts.** The lack of unnatural noise / metallic sounds the separation produced |
| **ISR** | Image-to-Spatial-distortion Ratio | **Spatial (localization) distortion.** Whether the stereo left/right image is broken |

**Key points for using them differently**:

- For overall good/bad, **SDR** as the main metric.
- If "**vocals remain in the instrumental**" is the problem, look at **SIR** (the contamination metric).
- If "**strange noise rides on the voice**," look at **SAR** (the artifact metric).

In other words, **"overall is SDR, isolating the symptom is SIR/SAR."** This decomposes "somehow bad" into "is there a lot of contamination, or a lot of artifacts."

---

## Measure with museval (minimal implementation)

`museval` is the **official evaluation tool** of SiSEC / the Music Demixing Challenge, used in combination with the MUSDB18 dataset. The core `museval.evaluate` works **with just numpy arrays** (musdb-independent).

```bash
pip install museval
```

```python
import museval
import numpy as np

# references / estimates の形状はいずれも (n_sources, n_samples, n_channels)
# references = 正解ステム（ground truth）, estimates = モデルの分離結果
SDR, ISR, SIR, SAR = museval.evaluate(references, estimates)

# 返り値は (n_sources, n_frames) の配列。曲のスコアは「フレーム中央値」で集計する
sdr_per_source = np.nanmedian(SDR, axis=1)   # nan に強い中央値
print("vocals SDR:", sdr_per_source[0], "dB")
```

If using MUSDB18, the track-level high-level API is convenient.

```python
import musdb, museval

mus = musdb.DB(root="/path/to/musdb18", subsets="test")
results = museval.EvalStore()  # 複数トラックの結果を集計するストア

for track in mus:
    estimates = run_my_separation(track.audio)   # {"vocals": ndarray, "accompaniment": ndarray}
    scores = museval.eval_mus_track(track, estimates)  # 1曲を評価
    results.add_track(scores)

# 集計の作法: frames_agg='median'（フレーム中央値）→ tracks_agg='median'（曲中央値）
print(results)  # source別の SDR/SIR/SAR/ISR をまとめて表示
```

> 📌 **Why aggregation is the median**: there are silent or thin-sound sections mid-track, where SDR becomes an extreme value (sometimes −∞). **The mean gets dragged by outliers**, so museval aggregates **frame median → track median.** Always use the median when measuring yourself too.

---

## Compare tools/parameters on the same footing

In the [tool-selection article](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter), I wrote "**measure on your own material.**" This is that measurement. **Apply candidate tools in turn to the same eval set and line up the SDR.**

```python
def benchmark(separators: dict, eval_set) -> dict[str, float]:
    """各ツールの『曲中央値SDR』を返す。同じ素材・同じ指標で公平に比較する。"""
    results: dict[str, float] = {}
    for name, separate in separators.items():
        store = museval.EvalStore()
        for track in eval_set:
            store.add_track(museval.eval_mus_track(track, separate(track.audio)))
        results[name] = store.df.query("metric == 'SDR'")["score"].median()
    return results

scores = benchmark({"demucs": demucs_sep, "uvr_mdx": uvr_sep, "spleeter": spleeter_sep}, eval_set)
# 例: {"demucs": 8.9, "uvr_mdx": 8.4, "spleeter": 6.1} ← この素材ではDemucsが最良
```

With this, "I feel Demucs is good" turns into "Demucs is +0.5 dB higher on this material." **Decision-making rides on numbers instead of subjectivity** — and that becomes the persuasiveness of your proposal directly.

---

## The CI quality gate: stop regressions by machine

This is the **heart** of this piece and the point I'm **most evaluated on** in projects. You updated the model, changed a parameter, bumped a library — at that time, **automatically check in CI**, not by a person, whether **quality dropped.**

```python
# tests/test_separation_quality.py
import pytest

# 固定の評価セット（小さくてよい：5〜10曲、正解ステム付き）でSDRを測り、
# 閾値を割ったらテストを失敗させる＝「前より悪い変更」をmainに入れない
QUALITY_THRESHOLDS_DB = {"vocals": 7.5, "drums": 7.0, "bass": 6.5, "other": 4.5}

@pytest.mark.parametrize("source, floor", QUALITY_THRESHOLDS_DB.items())
def test_separation_quality_does_not_regress(source: str, floor: float):
    sdr = evaluate_fixed_set(model=load_model())   # {source: 曲中央値SDR}
    assert sdr[source] >= floor, (
        f"{source} SDR が退行: {sdr[source]:.2f} dB < 閾値 {floor} dB"
    )
```

What this achieves is **guaranteeing in code not just "it works" but "it hasn't gotten worse than before."** The CLAUDE.md principle — **prepare a verification path before implementing** — can be applied even to a domain like source separation that looks subjective. Give the threshold margin as an MVP and raise it with each improvement (ratchet).

> 💡 **How to make the eval set**: using MUSDB18's test is the royal road. For an internal project, **prepare correct stems by hand for 5–10 of the customer's representative materials** and make it a fixed set. It can be small — **detecting regressions achieves the goal.**

---

## How to watch real material with no reference

museval's premise is **correct stems (reference).** But **real material flowing into production usually has no correct answer.** Here, switch to realistic alternatives.

- **Reference-free alternative metric**: e.g., roughly estimate "**the energy ratio of vocal components remaining in the instrumental track.**" A lot of residue = low quality, as an approximation. Not perfect, but **sufficient for outlier detection.**
- **Human sampling**: every track is impossible, but have a person spot-check **a random sample or only the ones suspected of being low-score.** Lining them up by suspicion via the alternative metric drastically cuts review effort.
- **Production monitoring**: emit the alternative metric per process to **structured logs + metrics**, and alert on a distribution anomaly (sudden worsening). For incorporating observability, see [the production-API article](/blog/music-source-separation-production-api-gpu-worker-queue).

```python
def vocal_bleed_proxy(no_vocals: np.ndarray, vocals: np.ndarray) -> float:
    """伴奏に残るボーカル成分の近似。低いほど良い（消し残りが少ない）。
    正解ステムが無い実素材の『リファレンスフリーな品質プロキシ』として使う。"""
    # 相関ベースで no_vocals に vocals がどれだけ漏れているかを推定
    return float(np.corrcoef(no_vocals.flatten(), vocals.flatten())[0, 1] ** 2)
```

**The trade-off**: the goal for real-material quality is **"detecting worsening" rather than "an accurate absolute value."** An alternative metric is useful enough.

---

## Cautions when reading SDR (traps)

- **A relative value depending on data/version**: SDR changes with the evaluation dataset, model version, and evaluation method (frame aggregation). **Don't directly compare paper A's 9.2 dB with article B's number.** Always compare with **the same eval set and the same settings.**
- **−∞ for silent sources**: in sections where that source isn't sounding mid-track, SDR goes to extreme values. Absorb it with **median aggregation.**
- **Alignment**: if the estimate and reference differ in length/phase, it comes out unfairly low. **Align them in preprocessing.**
- **SDR isn't everything**: perception is also affected by SIR/SAR. View it three-dimensionally with **main metric SDR + per-symptom SIR/SAR.**

---

## Frequently asked questions (FAQ)

**Q. How much SDR is "good"?**
A. There's no absolute standard (data-dependent). As a rule of thumb, for vocals in a MUSDB-family evaluation, the **7–9 dB range is high quality.** But **relative comparison on your own eval set** is the essence.

**Q. Is museval heavy?**
A. It computes a fair amount. A CI quality gate is fine with **a small fixed set (5–10 tracks).** You don't need to run all of MUSDB every time.

**Q. How do I prepare correct stems?**
A. MUSDB18 comes with stems. An internal project can make them if there's a **multitrack source (DAW tracks).** If not, a few by hand.

**Q. What about PESQ or SI-SDR?**
A. PESQ leans toward speech quality (telephony), and SI-SDR is a scale-invariant version of SDR used often recently. **museval (BSSEval v4) is the de facto standard for source separation**, so aligning on this first is safe.

**Q. Is confirming by ear no longer needed?**
A. **Using both is the right answer.** Watch all tracks with numbers, and **confirm only the suspicious ones by ear.** Ear and numbers have different roles.

---

## Conclusion: quality from "subjectivity" to "numbers"

Source separation looks, at a glance, subjective to evaluate. But — there are **established metrics: SDR/ISR/SIR/SAR**, the official tool **museval**, and you can stop regressions with a **CI quality gate.** Not ending at "it works," but taking it to **"measurable, comparable, non-regressing"** is production quality.

1. **Quantify SDR with museval** and aggregate frame median → track median.
2. **Compare tools/parameters on the same eval set** and put decision-making on numbers.
3. With a **CI quality gate**, mechanically stop regressions on changes.
4. **Monitor real material in production with an alternative metric + human sampling.**

And — **this stance of "guaranteeing quality with numbers" is where outsourcing produces the most trust.** "It's probably good" and "SDR median 8.9 dB on the eval set, regression-gated in CI" carry entirely different weight in a proposal.

> I've incorporated the evaluation and quality gate here into the **AI video-localization platform I actually run in production.** Not limited to source separation, if you're considering building a design/test foundation that guarantees AI-processing quality with numbers, take a look at my [track record](/case-studies/ai-video-localization-lipsync) and consult me. With **one person × generative AI**, I build it out to verifiable quality.

---

## Sources / related resources

- **Evaluation tool**: [sigsep/sigsep-mus-eval (museval)](https://github.com/sigsep/sigsep-mus-eval) — BSSEval v4 (SDR/ISR/SIR/SAR)
- **Dataset**: [MUSDB18](https://sigsep.github.io/datasets/musdb.html) — standard evaluation data with correct stems
- Related: [tool selection](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter) / [Demucs](/blog/demucs-v4-music-source-separation-production-guide) / [UVR5・MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) / [production-API design](/blog/music-source-separation-production-api-gpu-worker-queue)

* SDR is a relative value depending on evaluation conditions. **Always compare with the same eval set and the same settings**, and don't directly compare numbers across papers.