The goal of this article
"I want to remove vocals in real time during a stream," "I want to extract only the voice live" — touch source separation and this request always comes up. And in most cases, the answer is "with standard models, true real-time is hard."
But rather than ending at "hard," this article shows, in the language of design, how close you can get, why there are limits, and whether you really need real-time in the first place. When you finish reading, you'll be able to:
- Explain why MDX-Net / Demucs / RoFormer are batch-oriented from the breakdown of latency.
- Understand the design to get close to low latency with chunk/streaming processing and its quality trade-offs.
- Discern "real-time needed/not needed" and make a judgment that doesn't break down on excessive requirements.
About the author (a disclosure of credibility): I have single-handedly designed, implemented, and run in production an AI audio/video platform with source separation as its first stage. In real operation I was consulted many times with "in real time," but solving the requirements, most were sufficient with near-real-time or batch. This article is an organization for doing design honest to the requirements, not swept along by the trendy "real-time." For production scale in batch, see the GPU-worker-platform article.
A 30-second summary (the conclusion first)
| Issue | Conclusion |
|---|---|
| Real-time with standard models? | Hard. MDX-Net/Demucs/RoFormer are non-causal (reference the future), batch-oriented |
| Why hard | Latency = STFT window + lookahead (algorithmic) + model inference (compute) + buffer |
| Ways to get close | Chunk/streaming, small models, low lookahead, overlap-add |
| The price | Boundary artifacts, quality drop, increased latency (always appear) |
| Noise suppression is separate | Speech enhancement (e.g. RNNoise) has mature real-time orientation. A separate problem from music separation |
| The realistic answer | Many are met by near-real-time or batch + throughput optimization |
| The right question | Rather than "can it be real-time," "do you truly need zero latency" |
Why standard models aren't suited to real-time
MDX-Net, Demucs, and BS-RoFormer are all designed to look at "future samples" for quality. This is fundamentally incompatible with low latency.
- Segment + overlap processing: cut audio into fixed-length chunks and overlap before and after to erase the seams. Overlapping = can't finalize until subsequent samples are gathered = lookahead delay occurs.
- Non-causal: these models estimate the present using not only past but future context too. Real-time (causal) processing presupposes "not looking at the future," so it structurally doesn't mesh.
- Large models: RoFormer in particular is heavy, so inference of one chunk itself takes time (the reality of speed/VRAM).
In other words, these are batch models optimized to "solve the whole file (or a long segment) together, quality-first." Real-time isn't in their design goals.
The breakdown of latency: what produces delay
To talk about "lowering latency," you need to decompose the true nature of delay. The delay of real-time audio processing is roughly the sum of three.
総遅延 = ① アルゴリズム遅延 (STFT窓長 + 先読み/オーバーラップ)
+ ② 計算遅延 (1チャンクのモデル推論時間)
+ ③ バッファリング遅延 (入出力のブロックサイズ)
- ① Algorithmic delay: frequency-domain processing is inherently delayed by the STFT window length. Plus the wait for the future from overlap. The more you raise quality (longer window, more overlap), the more delay increases — this is the core of the trade-off.
- ② Compute delay: the time to pass a chunk through the model. Make the chunk smaller and it drops, but context shrinks and quality falls. Also depends on GPU performance.
- ③ Buffering delay: the block unit of audio I/O. Make it smaller and overhead increases.
The perception of real-time has an acceptable limit per use case (response in conversation, monitoring, effects, etc. have different demands). "Whether ①+②+③ falls below that limit" is the feasibility criterion.
The design to get close to low latency (and the price that always appears)
Even if true zero latency is impossible, there are designs to get close to near-real-time. But it's always in exchange for quality.
Chunk/streaming processing
Rather than the whole file, process small chunks in order of arrival and join them with overlap-add.
# streaming_approx.py — ストリーミング近似の骨子(概念実装・品質トレードオフあり)
from collections import deque
class StreamingSeparator:
"""到着するチャンクを順に分離する近似。
注意: 標準の音源分離モデルは非因果のため、これは"近似"であり、
チャンク境界のアーティファクトと先読み遅延は原理的に残る。"""
def __init__(self, separate_fn, chunk_samples: int, lookahead: int):
self._separate = separate_fn # 1チャンクを分離する関数
self._chunk = chunk_samples
self._lookahead = lookahead # 品質のため少しだけ未来を待つ
self._buf: deque = deque()
def push(self, samples) -> "list | None":
self._buf.extend(samples)
need = self._chunk + self._lookahead # チャンク + 先読み分が揃うまで待つ
if len(self._buf) < need:
return None # まだ確定できない(= レイテンシの源)
window = [self._buf.popleft() for _ in range(self._chunk)]
return self._separate(window) # 境界はoverlap-addで均す(別途)
⚠️ This implementation is a concept. Since standard models are non-causal, even chunking leaves boundary artifacts and lookahead delay in principle. Make
chunksmaller and delay decreases but quality falls; increaselookaheadand quality rises but delay increases — there's no universal setting.
Other moves (all trade-offs)
- Use a small, lightweight model (MDX-Net family, etc.): lowers compute delay. Quality is inferior to the best model.
- Reduce overlap: lowers lookahead delay. Seam noise increases.
- Make the GPU stronger: only the compute delay can be lowered (algorithmic delay doesn't change).
- Adopt a causal / low-latency-specialized model: standard distributed checkpoints are mostly non-causal. A low-latency-premised model is needed separately, and you compromise on quality.
Noise suppression (speech enhancement) is a different world
It's easy to confuse, so let me clearly separate them. "Removing noise in real time" and "separating vocals from music in real time" are different problems.
- Noise suppression / speech enhancement (removing meeting noise, etc.) has mature real-time-oriented methods. For example, RNNoise (Xiph) is widely used as lightweight, low-latency real-time noise suppression. 🟡 The use is "voice vs. noise," easy to design causally and lightweightly.
- Music source separation (demixing) is the task of separating musically overlapping components — voice, drums, bass, accompaniment — and quality requires rich context (= the future). So standard models are batch-oriented, and it's accurate to understand that true real-time music separation is unsolved with standard models.
If "I want to remove only the vocals in real time" is actually "I want to remove noise," you should look at the speech-enhancement domain — rephrasing the requirement leads to the right tech selection.
The most important question: do you truly need real-time?
This is where the designer's skill shows. Most "in real time" requests are met by another solution when you solve the requirements.
| The apparent request | The true requirement | The realistic answer |
|---|---|---|
| "Remove voice in a live stream" | A few seconds of delay is often acceptable | Near-real-time (small chunk + small model) |
| "Process uploaded video" | Doesn't need to be instant | Batch + throughput optimization |
| "Remove meeting noise" | Not music separation but noise suppression | Speech enhancement (RNNoise family) |
| "Process thousands of files fast" | Not one item's delay but total processing time | Parallel batch (GPU-worker platform) |
💡 A design principle: don't jump at "can it be real-time?"; first ask "how many seconds of delay is acceptable" and "is there truly a requirement that breaks without zero latency." Many are met by near-real-time or batch, and that's more stable in quality and cost. An excessive low-latency requirement sacrifices both quality and cost.
FAQ
Q. Can UVR5 / audio-separator do real-time processing? A. These are batch-processing tools, and real-time isn't anticipated. They're optimized for high-quality separation of one file (or segment) all at once. If low latency is needed, a different approach is required.
Q. If I make the chunk smaller, will it become real-time? A. Delay decreases, but context shrinks, quality falls, and boundary artifacts increase. Since standard models are non-causal, chunking is an "approximation," and the inherent lookahead delay and quality drop remain.
Q. I want to remove only noise in real time. A. That's not music separation but the domain of speech enhancement (noise suppression), and real-time-oriented methods are mature, like RNNoise. If the requirement is "voice vs. noise," that's appropriate.
Q. If I make the GPU stronger, will it become real-time? A. Compute delay decreases, but algorithmic delay (window length, lookahead) doesn't change. GPU reinforcement alone often doesn't reach true real-time.
Q. So what should I do in the end? A. First decide the acceptable delay in numbers. If a few seconds is acceptable, near-real-time; if it doesn't need to be instant, batch. Questioning "do you truly need zero latency" is the most effective design decision.
Summary: solve the requirements before jumping at real-time
Making source separation real-time is not a binary of "can/can't" but a problem of the trade-off between delay and quality.
- Standard models are batch-oriented: due to non-causality, lookahead, and size, true real-time is hard.
- Delay is the sum of three elements: algorithmic + compute + buffer. The more you raise quality, the more delay increases.
- Every way to get close is a trade-off: chunking, small models, low lookahead — in exchange for quality.
- Noise suppression is a different world: speech enhancement has mature real-time orientation.
- The right question is "do you truly need zero latency": many are met by near-real-time or batch.
An engineer trusted with outsourcing is one who, against an "in real time" request, can solve the requirements and propose the optimal delay design. For balancing latency, quality, and cost in audio/video AI, consult me along with my track record. With one-person × generative AI, from requirements definition to production operation, I support with design honest to the requirements rather than the trend.
Sources / related resources
- Characteristics of batch-oriented models: MDX-Net / Demucs v4 / BS-RoFormer
- Real-time noise suppression (reference): xiph/rnnoise
- Production scale in batch: this blog's GPU-worker platform / AWS GPU batch
- Model selection: this blog's how to choose a source-separation tool
- The acceptable latency value and the optimal chunk/lookahead settings depend on the use and hardware. This article's latency decomposition and low-latency moves are based on general audio-processing design principles and don't guarantee the real-time performance of a specific model. Verify with your own requirements and environment before implementation.