Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

A voice agent's quality is decided less by smartness than by the pause (ma). No matter how good the answer, if it stays silent for 2 seconds from when the user asks until audio comes out, it feels "broken." Conversely, just by the first word coming back in 0.5 seconds and going silent instantly when the user starts talking, the experience suddenly becomes "dialogue."

This article is a guide to production-implementing a low-latency voice agent with Alibaba Cloud (Qwen)'s Qwen3-TTS-Flash-Realtime. The basics (model, voice, pricing) are left to the Qwen-TTS production-operations guide; this article concentrates on the design of "reply-while-talking" streaming. The RAG-customer-service context I actually built in the generative-AI voice chatbot.

Rules for this article: the API specs are based on the Alibaba Cloud Model Studio Qwen-TTS real-time speech-synthesis documentation (as of June 2026). Method names and event names are written per the official SDK, but the SDK is updated, so confirm the latest types in the official documentation before production. The API key assumes environment variables (hardcoding strictly forbidden, don't expose to the browser).

0. The big picture: the voice-agent loop and the latency budget

A real-time voice agent is a chain of 3 stages.

user utterance → ① STT (transcription) → ② LLM (response generation) → ③ TTS (speech synthesis) → playback
       ↑__________________________ barge-in (stop on interruption)__________________________|

① STT is Whisper / gpt-4o-transcribe streaming.
② LLM response streaming is the Vercel AI SDK production guide.
③ TTS is the protagonist of this article. Qwen-TTS realtime handles it.

What decides perception is not the total latency but the time until the first audio. So there are 2 iron rules of design.

Pipeline the stages: don't wait for the LLM's full generation; flow to TTS the moment the first sentence is done.
Make first_audio_delay a measurement target: make "the time to the first audio" a production SLI (metric) and monitor regressions (chapter 6).

"The first is fast" beats "the total is fast." Batch narration generation and dialogue (real-time) are different things. If batch suffices, not opening a WebSocket is simpler and cheaper (KISS). This article is about dialogue.

1. The basics of the realtime API: WebSocket and events

Qwen-TTS realtime runs on a full-duplex WebSocket.

Models: qwen3-tts-flash-realtime / qwen3-tts-instruct-flash-realtime
Endpoint (international): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime (mainland China is dashscope.aliyuncs.com)
Audio arrives incrementally via the response.audio.delta event as base64-encoded PCM (24kHz/mono/16bit)

The official SDK receives server events with callbacks. The minimal implementation is like this.

import os
import base64
import threading
import dashscope
from dashscope.audio.qwen_tts_realtime import *

class TtsCallback(QwenTtsRealtimeCallback):
    """サーバーから届くイベントを処理する。音声断片はここに流れてくる。"""
    def __init__(self, sink) -> None:
        self._done = threading.Event()
        self._sink = sink  # 受け取ったPCMの行き先（ファイル / 再生 / WS中継）

    def on_open(self) -> None:
        # 接続確立。プレイヤーの初期化などはここで
        pass

    def on_close(self, code, msg) -> None:
        self._sink.close()

    def on_event(self, response: dict) -> None:
        etype = response["type"]
        if etype == "session.created":
            pass  # response["session"]["id"] でセッションIDが取れる
        elif etype == "response.audio.delta":
            pcm = base64.b64decode(response["delta"])  # 24kHz mono 16bit
            self._sink.write(pcm)                       # 即・下流へ（溜めない）
        elif etype == "response.done":
            pass  # 1応答ぶんの合成が完了
        elif etype == "session.finished":
            self._done.set()                            # セッション終了

    def wait(self) -> None:
        self._done.wait()

Connection and session setup:

dashscope.api_key = os.environ["DASHSCOPE_API_KEY"]
dashscope.base_websocket_api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"

callback = TtsCallback(sink)   # 自分で参照を保持（後で wait() する）
tts = QwenTtsRealtime(
    model="qwen3-tts-flash-realtime",
    callback=callback,
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
)
tts.connect()
tts.update_session(
    voice="Cherry",
    response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
    mode="server_commit",   # 第2章で commit と比較
)

How to send text changes by mode (next chapter). Basically, flow it in with append_text() and confirm with finish() (or commit()).

tts.append_text("本日はお問い合わせありがとうございます。")
tts.finish()       # server_commit: これ以上テキストが無いことを通知
callback.wait()    # session.finished まで待つ（保持した参照を使う）

2. `server_commit` and `commit`: who decides the segmentation

realtime has 2 modes that differ in how text is segmented, and you choose by use.

	`server_commit` (server-driven)	`commit` (client-driven)
Segmentation	the server smartly splits the text	the client makes it explicit with `commit()`
How to send	repeat `append_text()` → `finish()`	`append_text()` → `commit()` per utterance
Suited use	continuous synthesis of articles / long narration	dialogue (want to segment per utterance)

When flowing LLM output in a voice agent, there are 2 options.

A. Send per sentence with commit: while receiving the LLM's tokens, segment by period/newline and commit(). You can make the first sentence into audio fastest (minimal first_audio_delay). The code increases by the amount of control gained.
B. Throw it all at server_commit: keep append_text()-ing the tokens as-is, and finish() at the end. The implementation is simple, but segmentation is left to the server.

When in doubt, dialogue is A (commit), article reading is B (server_commit). In dialogue where "the first audio fastest" is top priority, it's worth cutting the sentence boundaries yourself (ETC: easy to swap the segmentation logic later).

3. LLM → TTS pipeline: make the first sentence into audio fastest

The crux is "don't wait for the LLM's full text." When tokens flow in, pass them to TTS immediately at sentence breaks. Carve sentence splitting into a single-responsibility pure function (SRP, testability).

import re

_SENTENCE_END = re.compile(r"(?<=[。．！？!?\n])")

def split_complete_sentences(buffer: str) -> tuple[list[str], str]:
    """確定した文のリストと、未確定の残りを返す純関数。
    句点・改行までを「確定文」とし、途中の断片は次回に持ち越す。"""
    parts = _SENTENCE_END.split(buffer)
    if not buffer.endswith(("。", "．", "！", "？", "!", "?", "\n")):
        remainder = parts.pop() if parts else ""
    else:
        remainder = ""
    return [p for p in parts if p.strip()], remainder

def speak_llm_stream(tts: "QwenTtsRealtime", token_stream) -> None:
    """LLMのトークンストリームを、文ができ次第TTSへ流す（commitモード）。"""
    buf = ""
    for token in token_stream:        # ②LLMのストリーミング出力
        buf += token
        sentences, buf = split_complete_sentences(buf)
        for s in sentences:
            tts.append_text(s)
            tts.commit()              # 文ごとに確定 → 最初の音が早く出る
    if buf.strip():                   # 末尾の残り
        tts.append_text(buf)
    tts.finish()

Beware backpressure: playback is slower than the LLM's generation (audio advances in real time), so calling append_text without limit swells the internal queue. For long text, watch "seconds played" and throttle the sending, or leave it to the server with server_commit. Audio is bound by the physics of real time — the decisive difference from a text stream.

4. Browser playback: play PCM without exposing the key

Don't hit DashScope directly from the browser (API-key leak). The correct configuration is 3 layers.

Browser  ⇄  your server (Next.js)  ⇄  DashScope WS
(audio playback)   (holds key, WS relay, authorization)   (synthesis)

The server (Next.js / Node) opens a WebSocket with the browser, internally connects to DashScope realtime, and relays the audio fragments. The browser plays the arrived PCM gaplessly with the Web Audio API. A player that plays 24kHz/mono/16bit PCM as-is can be written like this.

/** 24kHz mono 16bit PCM のチャンクを、隙間なく順に再生するプレイヤー。
 *  AudioBufferSourceNode を時刻指定でスケジュールし、途切れを防ぐ。 */
export class PcmStreamPlayer {
  private ctx: AudioContext;
  private nextStartTime = 0;
  private readonly sources = new Set<AudioBufferSourceNode>();

  constructor(private readonly sampleRate = 24_000) {
    this.ctx = new AudioContext({ sampleRate });
  }

  /** Int16 PCM（リトルエンディアン）を1チャンク受け取り、末尾に連結再生する。 */
  enqueue(pcm: ArrayBuffer): void {
    const int16 = new Int16Array(pcm);
    const float32 = new Float32Array(int16.length);
    for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 0x8000; // -1..1へ正規化

    const buffer = this.ctx.createBuffer(1, float32.length, this.sampleRate);
    buffer.getChannelData(0).set(float32);

    const src = this.ctx.createBufferSource();
    src.buffer = buffer;
    src.connect(this.ctx.destination);

    const startAt = Math.max(this.ctx.currentTime, this.nextStartTime);
    src.start(startAt);                       // 前のチャンクの直後にスケジュール
    this.nextStartTime = startAt + buffer.duration;

    this.sources.add(src);
    src.onended = () => this.sources.delete(src);
  }

  /** バージイン（割り込み）：再生中の音を即座に止め、キューを捨てる。 */
  stop(): void {
    for (const s of this.sources) { try { s.stop(); } catch {} }
    this.sources.clear();
    this.nextStartTime = this.ctx.currentTime;
  }
}

The browser-side receive loop (play the PCM the server relayed):

const player = new PcmStreamPlayer();
const ws = new WebSocket(`wss://${location.host}/api/voice`); // 自サーバーのWS
ws.binaryType = "arraybuffer";

ws.onmessage = (e) => {
  if (e.data instanceof ArrayBuffer) player.enqueue(e.data); // 音声PCM
  // 制御メッセージ（テキスト）は別途JSONで受ける設計にする
};

Start the AudioContext on a user gesture: many browsers block autoplay. Call new AudioContext() / ctx.resume() on the onClick of a "Start conversation" button to enable audio from a user origin (this is a demand of both a11y and browser policy).

Always authorize the relay WS (/api/voice): this endpoint is a billing path that holds the DashScope API key. Before relaying, verify a session Cookie or a short-lived token at the handshake stage. If unauthenticated, a third party uses TTS on your bill (protect with the relay's authorization instead of exposing the key to the browser).

5. Barge-in (interruption): go silent instantly when talked to

The most important feature that makes a dialogue a "dialogue" is barge-in. When the user starts talking during the agent's utterance, ① stop generation ② discard the audio being played instantly.

When you detect the user's utterance with VAD (voice activity detection),
discard the in-progress TTS session on the server side (don't relay subsequent response.audio.delta),
flush the playback buffer with the browser's player.stop().

// ブラウザ：ユーザーが話し始めた合図を受けたら即停止
function onUserStartedSpeaking() {
  player.stop();                         // 再生中の音を捨てる（残響を残さない）
  ws.send(JSON.stringify({ type: "barge_in" })); // サーバーに生成中断を依頼
}

The crux is "give the responsibility for stopping to the playback side." Even if the server's interruption is a moment late, if the browser goes silent first, perception isn't interrupted. In the world of latency, stopping at the layer closest to the user is the right answer.

6. Resilience, idempotency, observability

WebSockets disconnect. realtime also bills. For production, put in the following.

Measure first_audio_delay: get "the time to the first audio" with the SDK's get_first_audio_delay() and output it to structured logs / metrics. The official explicitly notes "the first time is slow at the start because it includes establishing the WS connection," so pooling connections and keeping them warm is the standard.
Automatic reconnection (exponential backoff): reconnect on detecting on_close. But don't double-play the same utterance — assign a unique ID to utterances and skip already-played IDs (idempotency).
Timeout and fallback: if first_audio_delay exceeds a threshold (e.g., 1.5 seconds), switch to a canned "please wait a moment" audio, or fall back to batch TTS (resilience).
Visualize cost: character count = billing amount. Per session, record "synthesized characters, estimated cost, first_audio_delay, interruption count," and correlate with OpenTelemetry.
Don't leave PII: don't leave conversation text (which may contain personal info) in logs; metadata only.

# セッション終了時に計測値を構造化ログへ（本文は出さない）
log.info("tts_session", extra={
    "session_id": tts.get_session_id(),
    "first_audio_delay_ms": tts.get_first_audio_delay(),
    "chars": total_chars,
    "barge_in_count": barge_ins,
})

7. Conclusion: a design checklist for a low-latency voice agent

The metric is first_audio_delay: optimize "the first audio," not the total, and make it an SLI.
Pipeline: when the LLM's first sentence is done, immediately TTS (send per sentence with commit).
3-layer configuration: browser ⇄ your server ⇄ DashScope. Key on the server, playback in the browser.
Gapless playback: time-schedule PCM 24kHz with AudioBufferSourceNode.
Barge-in: instant stop at the playback layer closest to the user + generation interruption.
Resilience: reconnection + utterance-ID idempotency, canned-audio fallback on timeout.
Observability: correlate first_audio_delay, character count, cost, and interruption count in logs. Don't output PII text.

A voice agent is the design of "the pause" more than "smartness." In the RAG-equipped voice-customer-service system, I built in the speed of response and the naturalness of interruption. "Turning your business into a no-waiting voice dialogue" — I can accompany you end-to-end from design through implementation and operation. Feel free to consult from requirements organization.

References (official documentation)

Qwen-TTS real-time speech synthesis (Alibaba Cloud Model Studio) — WebSocket protocol, SDK, events, modes
Qwen-TTS speech synthesis (Model Studio) — the basics of model, voice, parameters
MDN: Web Audio API — scheduled playback with AudioBufferSourceNode

Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

0. The big picture: the voice-agent loop and the latency budget

1. The basics of the realtime API: WebSocket and events

2. `server_commit` and `commit`: who decides the segmentation

3. LLM → TTS pipeline: make the first sentence into audio fastest

4. Browser playback: play PCM without exposing the key

5. Barge-in (interruption): go silent instantly when talked to

6. Resilience, idempotency, observability

7. Conclusion: a design checklist for a low-latency voice agent

References (official documentation)

Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning

Qwen-TTS voice-cloning production guide: self-hosting the OSS version (Apache-2.0), and the governance design of consent, disclosure, and provenance

Also worth reading

MuseTalk Complete Guide: Operating Realtime Lip Sync (Latent-Space Inpainting) in Production, Faithful to Official Sources

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

Is real-time source separation possible: the design and limits of low latency (the reality of streaming processing)

0. The big picture: the voice-agent loop and the latency budget

1. The basics of the realtime API: WebSocket and events

2. server_commit and commit: who decides the segmentation

3. LLM → TTS pipeline: make the first sentence into audio fastest

4. Browser playback: play PCM without exposing the key

5. Barge-in (interruption): go silent instantly when talked to

6. Resilience, idempotency, observability

7. Conclusion: a design checklist for a low-latency voice agent

References (official documentation)

Related articles

Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)

Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning

Qwen-TTS voice-cloning production guide: self-hosting the OSS version (Apache-2.0), and the governance design of consent, disclosure, and provenance

Also worth reading

MuseTalk Complete Guide: Operating Realtime Lip Sync (Latent-Space Inpainting) in Production, Faithful to Official Sources

Qwen3-8B-AWQ practical guide: self-hosting a 'reasoning LLM' on a single GPU with 4-bit quantization

Is real-time source separation possible: the design and limits of low latency (the reality of streaming processing)

2. `server_commit` and `commit`: who decides the segmentation