# Is real-time source separation possible: the design and limits of low latency (the reality of streaming processing)

> You want to do source separation (vocal/accompaniment separation) in real time, at low latency — this honestly explains its feasibility from the breakdown of latency and the characteristics of each model. Why MDX-Net/Demucs/RoFormer are inherently batch-oriented, the design and quality trade-offs to get close with chunk/streaming processing, the difference from noise suppression, and how to discern 'do you really need real-time' — it shows the reality of implementation.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 音源分離, リアルタイム, 低遅延, ストリーミング, 音声処理, Python, アーキテクチャ設計, AI音声
- URL: https://tomodahinata.com/en/blog/realtime-low-latency-source-separation-design-limits
- Category: Audio source separation & preprocessing
- Pillar guide: https://tomodahinata.com/en/blog/music-source-separation-tool-selection-demucs-uvr-spleeter

## Key points

- Major models like MDX-Net/Demucs/RoFormer are inherently batch-oriented. Because they're non-causal processing that references future samples via segment + overlap, they're unsuited to true real-time.
- Latency = algorithmic delay (STFT window + lookahead) + compute delay (model inference) + buffering. All are a trade-off with the 'look at the future for quality' design.
- Ways to get close are chunk/streaming processing, small models, and low lookahead. But the trade-offs of boundary artifacts, quality drop, and increased latency always appear.
- Noise suppression (speech enhancement, e.g. RNNoise) has mature real-time-oriented methods. Meanwhile, understand that real-time full music separation is unsolved with standard models.
- Many requirements are met by 'near-real-time' or 'batch + throughput optimization.' Discerning whether you truly need zero latency first is the correct design decision.

---

## The goal of this article

"I want to **remove vocals in real time during a stream**," "I want to **extract only the voice live**" — touch source separation and this request always comes up. And in most cases, the answer is **"with standard models, true real-time is hard."**

But rather than ending at "hard," this article shows, in the language of design, **how close you can get, why there are limits, and whether you really need real-time in the first place.** When you finish reading, you'll be able to:

1. Explain **why MDX-Net / Demucs / RoFormer are batch-oriented** from the breakdown of latency.
2. Understand **the design to get close to low latency with chunk/streaming processing** and its **quality trade-offs.**
3. Discern "real-time needed/not needed" and make a judgment that **doesn't break down on excessive requirements.**

> **About the author (a disclosure of credibility)**: I have **single-handedly designed, implemented, and run in production an AI audio/video platform with source separation as its first stage.** In real operation I was consulted many times with "in real time," but solving the requirements, **most were sufficient with near-real-time or batch.** This article is an organization for doing **design honest to the requirements**, not swept along by the trendy "real-time." For production scale in batch, see the [GPU-worker-platform article](/blog/music-source-separation-production-api-gpu-worker-queue).

---

## A 30-second summary (the conclusion first)

| Issue | Conclusion |
| --- | --- |
| **Real-time with standard models?** | Hard. MDX-Net/Demucs/RoFormer are **non-causal (reference the future), batch-oriented** |
| **Why hard** | Latency = STFT window + lookahead (algorithmic) + model inference (compute) + buffer |
| **Ways to get close** | Chunk/streaming, small models, low lookahead, overlap-add |
| **The price** | **Boundary artifacts, quality drop, increased latency** (always appear) |
| **Noise suppression is separate** | Speech enhancement (e.g. RNNoise) has **mature real-time orientation.** A separate problem from music separation |
| **The realistic answer** | Many are met by **near-real-time** or **batch + throughput optimization** |
| **The right question** | Rather than "can it be real-time," **"do you truly need zero latency"** |

---

## Why standard models aren't suited to real-time

[MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide), [Demucs](/blog/demucs-v4-music-source-separation-production-guide), and [BS-RoFormer](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide) are all designed to **look at "future samples" for quality.** This is fundamentally incompatible with low latency.

- **Segment + overlap processing**: cut audio into fixed-length chunks and **overlap before and after** to erase the seams. Overlapping = **can't finalize until subsequent samples are gathered** = lookahead delay occurs.
- **Non-causal**: these models estimate the present using not only past but **future context too.** Real-time (causal) processing presupposes "not looking at the future," so it structurally doesn't mesh.
- **Large models**: RoFormer in particular is heavy, so **inference of one chunk itself takes time** ([the reality of speed/VRAM](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide)).

In other words, these are **batch models optimized to "solve the whole file (or a long segment) together, quality-first."** Real-time isn't in their design goals.

---

## The breakdown of latency: what produces delay

To talk about "lowering latency," you need to decompose the true nature of delay. The delay of real-time audio processing is roughly the sum of three.

```text
総遅延 = ① アルゴリズム遅延   （STFT窓長 + 先読み/オーバーラップ）
       + ② 計算遅延         （1チャンクのモデル推論時間）
       + ③ バッファリング遅延 （入出力のブロックサイズ）
```

- **① Algorithmic delay**: frequency-domain processing is inherently delayed by the STFT window length. Plus the wait for the future from overlap. **The more you raise quality (longer window, more overlap), the more delay increases** — this is the core of the trade-off.
- **② Compute delay**: the time to pass a chunk through the model. **Make the chunk smaller and it drops, but context shrinks and quality falls.** Also depends on GPU performance.
- **③ Buffering delay**: the block unit of audio I/O. Make it smaller and overhead increases.

The perception of real-time has an acceptable limit per use case (response in conversation, monitoring, effects, etc. have different demands). **"Whether ①+②+③ falls below that limit"** is the feasibility criterion.

---

## The design to get close to low latency (and the price that always appears)

Even if true zero latency is impossible, there are designs to **get close to near-real-time.** But it's **always in exchange for quality.**

### Chunk/streaming processing

Rather than the whole file, **process small chunks in order of arrival** and join them with overlap-add.

```python
# streaming_approx.py — ストリーミング近似の骨子（概念実装・品質トレードオフあり）
from collections import deque

class StreamingSeparator:
    """到着するチャンクを順に分離する近似。
    注意: 標準の音源分離モデルは非因果のため、これは"近似"であり、
    チャンク境界のアーティファクトと先読み遅延は原理的に残る。"""

    def __init__(self, separate_fn, chunk_samples: int, lookahead: int):
        self._separate = separate_fn          # 1チャンクを分離する関数
        self._chunk = chunk_samples
        self._lookahead = lookahead           # 品質のため少しだけ未来を待つ
        self._buf: deque = deque()

    def push(self, samples) -> "list | None":
        self._buf.extend(samples)
        need = self._chunk + self._lookahead  # チャンク + 先読み分が揃うまで待つ
        if len(self._buf) < need:
            return None                        # まだ確定できない（= レイテンシの源）
        window = [self._buf.popleft() for _ in range(self._chunk)]
        return self._separate(window)          # 境界はoverlap-addで均す（別途）
```

> ⚠️ This implementation is a **concept.** Since standard models are non-causal, even chunking leaves **boundary artifacts** and **lookahead delay** in principle. Make `chunk` smaller and delay decreases but quality falls; increase `lookahead` and quality rises but delay increases — **there's no universal setting.**

### Other moves (all trade-offs)

- **Use a small, lightweight model** ([MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) family, etc.): lowers compute delay. Quality is inferior to the best model.
- **Reduce overlap**: lowers lookahead delay. Seam noise increases.
- **Make the GPU stronger**: only the compute delay can be lowered (algorithmic delay doesn't change).
- **Adopt a causal / low-latency-specialized model**: standard distributed checkpoints are mostly non-causal. A low-latency-premised model is needed separately, and you compromise on quality.

---

## Noise suppression (speech enhancement) is a different world

It's easy to confuse, so let me clearly separate them. **"Removing noise in real time" and "separating vocals from music in real time" are different problems.**

- **Noise suppression / speech enhancement** (removing meeting noise, etc.) has **mature real-time-oriented methods.** For example, [RNNoise](https://github.com/xiph/rnnoise) (Xiph) is widely used as lightweight, low-latency real-time noise suppression. 🟡 The use is "voice vs. noise," easy to design causally and lightweightly.
- **Music source separation (demixing)** is the task of separating **musically overlapping components** — voice, drums, bass, accompaniment — and quality requires rich context (= the future). So standard models are batch-oriented, and it's accurate to understand that **true real-time music separation is unsolved with standard models.**

If "I want to remove only the vocals in real time" is **actually "I want to remove noise,"** you should look at the speech-enhancement domain — **rephrasing the requirement leads to the right tech selection.**

---

## The most important question: do you truly need real-time?

This is where the designer's skill shows. Most "in real time" requests are **met by another solution when you solve the requirements.**

| The apparent request | The true requirement | The realistic answer |
| --- | --- | --- |
| "Remove voice in a live stream" | A few seconds of delay is often acceptable | **Near-real-time** (small chunk + small model) |
| "Process uploaded video" | Doesn't need to be instant | **Batch + throughput optimization** |
| "Remove meeting noise" | Not music separation but noise suppression | **Speech enhancement** (RNNoise family) |
| "Process thousands of files fast" | Not one item's delay but total processing time | **Parallel batch** ([GPU-worker platform](/blog/music-source-separation-production-api-gpu-worker-queue)) |

> 💡 **A design principle**: don't jump at "can it be real-time?"; first ask **"how many seconds of delay is acceptable" and "is there truly a requirement that breaks without zero latency."** Many are met by near-real-time or batch, and that's more stable in quality and cost. An excessive low-latency requirement sacrifices both quality and cost.

---

## FAQ

**Q. Can UVR5 / audio-separator do real-time processing?**
A. These are **batch-processing tools**, and real-time isn't anticipated. They're optimized for high-quality separation of one file (or segment) all at once. If low latency is needed, a different approach is required.

**Q. If I make the chunk smaller, will it become real-time?**
A. Delay decreases, but **context shrinks, quality falls, and boundary artifacts increase.** Since standard models are non-causal, chunking is an "approximation," and the inherent lookahead delay and quality drop remain.

**Q. I want to remove only noise in real time.**
A. That's not music separation but the domain of **speech enhancement (noise suppression),** and **real-time-oriented methods are mature**, like [RNNoise](https://github.com/xiph/rnnoise). If the requirement is "voice vs. noise," that's appropriate.

**Q. If I make the GPU stronger, will it become real-time?**
A. **Compute delay decreases, but algorithmic delay (window length, lookahead) doesn't change.** GPU reinforcement alone often doesn't reach true real-time.

**Q. So what should I do in the end?**
A. First **decide the acceptable delay in numbers.** If a few seconds is acceptable, near-real-time; if it doesn't need to be instant, batch. **Questioning "do you truly need zero latency"** is the most effective design decision.

---

## Summary: solve the requirements before jumping at real-time

Making source separation real-time is not a binary of "can/can't" but a problem of **the trade-off between delay and quality.**

1. **Standard models are batch-oriented**: due to non-causality, lookahead, and size, true real-time is hard.
2. **Delay is the sum of three elements**: algorithmic + compute + buffer. The more you raise quality, the more delay increases.
3. **Every way to get close is a trade-off**: chunking, small models, low lookahead — in exchange for quality.
4. **Noise suppression is a different world**: speech enhancement has mature real-time orientation.
5. **The right question is "do you truly need zero latency"**: many are met by near-real-time or batch.

> An engineer trusted with outsourcing is one who, against an "in real time" request, can **solve the requirements and propose the optimal delay design.** For balancing latency, quality, and cost in audio/video AI, consult me along with my [track record](/case-studies/ai-video-localization-lipsync). With **one-person × generative AI**, from requirements definition to production operation, I support with design honest to the requirements rather than the trend.

---

## Sources / related resources

- **Characteristics of batch-oriented models**: [MDX-Net](/blog/uvr5-mdx-net-vocal-separation-production-guide) / [Demucs v4](/blog/demucs-v4-music-source-separation-production-guide) / [BS-RoFormer](/blog/bs-roformer-mel-band-roformer-vocal-separation-guide)
- **Real-time noise suppression (reference)**: [xiph/rnnoise](https://github.com/xiph/rnnoise)
- **Production scale in batch**: this blog's [GPU-worker platform](/blog/music-source-separation-production-api-gpu-worker-queue) / [AWS GPU batch](/blog/audio-source-separation-aws-gpu-batch-pipeline)
- **Model selection**: this blog's [how to choose a source-separation tool](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter)

* The acceptable latency value and the optimal chunk/lookahead settings depend on the use and hardware. This article's latency decomposition and low-latency moves are based on general audio-processing design principles and don't guarantee the real-time performance of a specific model. Verify with your own requirements and environment before implementation.
