# Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

> A guide to production-implementing a low-latency voice agent that 'replies while talking' with Qwen3-TTS-Flash-Realtime. With real code it explains the WebSocket bidirectional protocol (session.created / response.audio.delta / session.finished), the use of server_commit vs. commit, streaming synthesis of LLM output, gapless PCM 24kHz playback in the browser, barge-in (interruption), connection resilience, and measuring first_audio_delay.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Python, 音声合成, 生成AI, Qwen, リアルタイム, アーキテクチャ設計
- URL: https://tomodahinata.com/en/blog/qwen-tts-realtime-voice-agent-websocket-streaming-guide
- Category: Voice AI
- Pillar guide: https://tomodahinata.com/en/blog/voice-ai-production-guide-stt-tts-voice-agents

## Key points

- A voice agent's perceived quality is decided by 'the time until the first audio (first_audio_delay),' and this is placed at the center of design.
- Qwen-TTS realtime is append_text→commit/finish over WebSocket, and audio arrives incrementally via response.audio.delta (base64 PCM).
- Choose by requirements whether to commit the LLM's tokens at sentence boundaries, or leave the segmentation to the server with server_commit.
- Never expose the API key to the browser; a server like Next.js relays the WebSocket, and the browser plays PCM gaplessly.
- Production needs barge-in (instant stop on interruption), automatic reconnection, timeouts, and observability of cost/latency.

---

A voice agent's quality is decided less by smartness than by **the pause (ma).** No matter how good the answer, if it stays silent for 2 seconds from when the user asks until audio comes out, it feels "broken." Conversely, just by **the first word coming back in 0.5 seconds and going silent instantly when the user starts talking**, the experience suddenly becomes "dialogue."

This article is a guide to production-implementing a **low-latency voice agent** with Alibaba Cloud (Qwen)'s **Qwen3-TTS-Flash-Realtime.** The basics (model, voice, pricing) are left to the [Qwen-TTS production-operations guide](/blog/qwen-tts-qwen3-tts-flash-production-guide); this article concentrates on **the design of "reply-while-talking" streaming.** The RAG-customer-service context I actually built in [the generative-AI voice chatbot](/case-studies/ai-voice-chatbot).

> **Rules for this article**: the API specs are based on the **Alibaba Cloud Model Studio Qwen-TTS real-time speech-synthesis documentation (as of June 2026).** Method names and event names are written per the official SDK, but the SDK is updated, so confirm the latest types in the [official documentation](https://www.alibabacloud.com/help/en/model-studio/qwen-tts-realtime) before production. The API key assumes environment variables (hardcoding strictly forbidden, don't expose to the browser).

---

## 0. The big picture: the voice-agent loop and the latency budget

A real-time voice agent is a chain of 3 stages.

```text
user utterance → ① STT (transcription) → ② LLM (response generation) → ③ TTS (speech synthesis) → playback
       ↑__________________________ barge-in (stop on interruption)__________________________|
```

- ① STT is [Whisper / gpt-4o-transcribe streaming](/blog/openai-whisper-production-guide-selfhost-vs-api).
- ② LLM response streaming is the [Vercel AI SDK production guide](/blog/vercel-ai-sdk-production-llm-apps-streaming-tools-rag).
- ③ **TTS is the protagonist of this article.** Qwen-TTS realtime handles it.

What decides perception is not the total latency but **the time until the first audio.** So there are 2 iron rules of design.

1. **Pipeline the stages**: don't wait for the LLM's full generation; **flow to TTS the moment the first sentence is done.**
2. **Make first_audio_delay a measurement target**: make "the time to the first audio" a production SLI (metric) and monitor regressions (chapter 6).

> "The first is fast" beats "the total is fast." Batch narration generation and dialogue (real-time) are different things. If batch suffices, not opening a WebSocket is simpler and cheaper (KISS). This article is about **dialogue.**

---

## 1. The basics of the realtime API: WebSocket and events

Qwen-TTS realtime runs on a **full-duplex WebSocket.**

- Models: `qwen3-tts-flash-realtime` / `qwen3-tts-instruct-flash-realtime`
- Endpoint (international): `wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime` (mainland China is `dashscope.aliyuncs.com`)
- Audio arrives incrementally via the **`response.audio.delta` event** as base64-encoded PCM (24kHz/mono/16bit)

The official SDK receives server events with **callbacks.** The minimal implementation is like this.

```python
import os
import base64
import threading
import dashscope
from dashscope.audio.qwen_tts_realtime import *

class TtsCallback(QwenTtsRealtimeCallback):
    """サーバーから届くイベントを処理する。音声断片はここに流れてくる。"""
    def __init__(self, sink) -> None:
        self._done = threading.Event()
        self._sink = sink  # 受け取ったPCMの行き先（ファイル / 再生 / WS中継）

    def on_open(self) -> None:
        # 接続確立。プレイヤーの初期化などはここで
        pass

    def on_close(self, code, msg) -> None:
        self._sink.close()

    def on_event(self, response: dict) -> None:
        etype = response["type"]
        if etype == "session.created":
            pass  # response["session"]["id"] でセッションIDが取れる
        elif etype == "response.audio.delta":
            pcm = base64.b64decode(response["delta"])  # 24kHz mono 16bit
            self._sink.write(pcm)                       # 即・下流へ（溜めない）
        elif etype == "response.done":
            pass  # 1応答ぶんの合成が完了
        elif etype == "session.finished":
            self._done.set()                            # セッション終了

    def wait(self) -> None:
        self._done.wait()
```

Connection and session setup:

```python
dashscope.api_key = os.environ["DASHSCOPE_API_KEY"]
dashscope.base_websocket_api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"

callback = TtsCallback(sink)   # 自分で参照を保持（後で wait() する）
tts = QwenTtsRealtime(
    model="qwen3-tts-flash-realtime",
    callback=callback,
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
)
tts.connect()
tts.update_session(
    voice="Cherry",
    response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
    mode="server_commit",   # 第2章で commit と比較
)
```

How to send text **changes by mode** (next chapter). Basically, flow it in with `append_text()` and confirm with `finish()` (or `commit()`).

```python
tts.append_text("本日はお問い合わせありがとうございます。")
tts.finish()       # server_commit: これ以上テキストが無いことを通知
callback.wait()    # session.finished まで待つ（保持した参照を使う）
```

---

## 2. `server_commit` and `commit`: who decides the segmentation

realtime has **2 modes that differ in how text is segmented**, and you choose by use.

| | `server_commit` (server-driven) | `commit` (client-driven) |
| --- | --- | --- |
| Segmentation | the server smartly splits the text | the client makes it explicit with `commit()` |
| How to send | repeat `append_text()` → `finish()` | `append_text()` → `commit()` per utterance |
| Suited use | continuous synthesis of articles / long narration | **dialogue** (want to segment per utterance) |

When flowing LLM output in a voice agent, there are 2 options.

- **A. Send per sentence with `commit`**: while receiving the LLM's tokens, **segment by period/newline and `commit()`.** You can make the first sentence into audio fastest (minimal first_audio_delay). The code increases by the amount of control gained.
- **B. Throw it all at `server_commit`**: keep `append_text()`-ing the tokens as-is, and `finish()` at the end. The implementation is simple, but segmentation is left to the server.

When in doubt, **dialogue is A (commit), article reading is B (server_commit).** In dialogue where "the first audio fastest" is top priority, it's worth cutting the sentence boundaries yourself (ETC: easy to swap the segmentation logic later).

---

## 3. LLM → TTS pipeline: make the first sentence into audio fastest

The crux is "**don't wait for the LLM's full text.**" When tokens flow in, **pass them to TTS immediately at sentence breaks.** Carve sentence splitting into a single-responsibility pure function (SRP, testability).

```python
import re

_SENTENCE_END = re.compile(r"(?<=[。．！？!?\n])")

def split_complete_sentences(buffer: str) -> tuple[list[str], str]:
    """確定した文のリストと、未確定の残りを返す純関数。
    句点・改行までを「確定文」とし、途中の断片は次回に持ち越す。"""
    parts = _SENTENCE_END.split(buffer)
    if not buffer.endswith(("。", "．", "！", "？", "!", "?", "\n")):
        remainder = parts.pop() if parts else ""
    else:
        remainder = ""
    return [p for p in parts if p.strip()], remainder

def speak_llm_stream(tts: "QwenTtsRealtime", token_stream) -> None:
    """LLMのトークンストリームを、文ができ次第TTSへ流す（commitモード）。"""
    buf = ""
    for token in token_stream:        # ②LLMのストリーミング出力
        buf += token
        sentences, buf = split_complete_sentences(buf)
        for s in sentences:
            tts.append_text(s)
            tts.commit()              # 文ごとに確定 → 最初の音が早く出る
    if buf.strip():                   # 末尾の残り
        tts.append_text(buf)
    tts.finish()
```

> **Beware backpressure**: playback is slower than the LLM's generation (audio advances in real time), so calling `append_text` without limit swells the internal queue. For long text, watch "seconds played" and throttle the sending, or leave it to the server with `server_commit`. **Audio is bound by the physics of real time** — the decisive difference from a text stream.

---

## 4. Browser playback: play PCM without exposing the key

**Don't hit DashScope directly from the browser** (API-key leak). The correct configuration is 3 layers.

```text
Browser  ⇄  your server (Next.js)  ⇄  DashScope WS
(audio playback)   (holds key, WS relay, authorization)   (synthesis)
```

The server (Next.js / Node) opens a WebSocket with the browser, internally connects to DashScope realtime, and **relays the audio fragments.** The browser **plays the arrived PCM gaplessly with the Web Audio API.** A player that plays 24kHz/mono/16bit PCM as-is can be written like this.

```ts
/** 24kHz mono 16bit PCM のチャンクを、隙間なく順に再生するプレイヤー。
 *  AudioBufferSourceNode を時刻指定でスケジュールし、途切れを防ぐ。 */
export class PcmStreamPlayer {
  private ctx: AudioContext;
  private nextStartTime = 0;
  private readonly sources = new Set<AudioBufferSourceNode>();

  constructor(private readonly sampleRate = 24_000) {
    this.ctx = new AudioContext({ sampleRate });
  }

  /** Int16 PCM（リトルエンディアン）を1チャンク受け取り、末尾に連結再生する。 */
  enqueue(pcm: ArrayBuffer): void {
    const int16 = new Int16Array(pcm);
    const float32 = new Float32Array(int16.length);
    for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 0x8000; // -1..1へ正規化

    const buffer = this.ctx.createBuffer(1, float32.length, this.sampleRate);
    buffer.getChannelData(0).set(float32);

    const src = this.ctx.createBufferSource();
    src.buffer = buffer;
    src.connect(this.ctx.destination);

    const startAt = Math.max(this.ctx.currentTime, this.nextStartTime);
    src.start(startAt);                       // 前のチャンクの直後にスケジュール
    this.nextStartTime = startAt + buffer.duration;

    this.sources.add(src);
    src.onended = () => this.sources.delete(src);
  }

  /** バージイン（割り込み）：再生中の音を即座に止め、キューを捨てる。 */
  stop(): void {
    for (const s of this.sources) { try { s.stop(); } catch {} }
    this.sources.clear();
    this.nextStartTime = this.ctx.currentTime;
  }
}
```

The browser-side receive loop (play the PCM the server relayed):

```ts
const player = new PcmStreamPlayer();
const ws = new WebSocket(`wss://${location.host}/api/voice`); // 自サーバーのWS
ws.binaryType = "arraybuffer";

ws.onmessage = (e) => {
  if (e.data instanceof ArrayBuffer) player.enqueue(e.data); // 音声PCM
  // 制御メッセージ（テキスト）は別途JSONで受ける設計にする
};
```

> **Start the `AudioContext` on a user gesture**: many browsers block autoplay. Call `new AudioContext()` / `ctx.resume()` on the `onClick` of a "Start conversation" button to enable audio from a user origin (this is a demand of both a11y and browser policy).

> **Always authorize the relay WS (`/api/voice`)**: this endpoint is a **billing path** that holds the DashScope API key. Before relaying, verify a session Cookie or a short-lived token at the handshake stage. If unauthenticated, a third party uses TTS on your bill (protect with the relay's authorization instead of exposing the key to the browser).

---

## 5. Barge-in (interruption): go silent instantly when talked to

The most important feature that makes a dialogue a "dialogue" is **barge-in.** When the user starts talking during the agent's utterance, **① stop generation ② discard the audio being played** instantly.

1. When you **detect the user's utterance with VAD (voice activity detection)**,
2. discard the in-progress TTS session on the server side (don't relay subsequent `response.audio.delta`),
3. flush the playback buffer with the browser's `player.stop()`.

```ts
// ブラウザ：ユーザーが話し始めた合図を受けたら即停止
function onUserStartedSpeaking() {
  player.stop();                         // 再生中の音を捨てる（残響を残さない）
  ws.send(JSON.stringify({ type: "barge_in" })); // サーバーに生成中断を依頼
}
```

The crux is "**give the responsibility for stopping to the playback side.**" Even if the server's interruption is a moment late, if the browser goes silent first, perception isn't interrupted. **In the world of latency, stopping at the layer closest to the user is the right answer.**

---

## 6. Resilience, idempotency, observability

WebSockets disconnect. realtime also bills. For production, put in the following.

- **Measure first_audio_delay**: get "the time to the first audio" with the SDK's `get_first_audio_delay()` and output it to structured logs / metrics. The official explicitly notes "the first time is slow at the start because it includes establishing the WS connection," so **pooling connections and keeping them warm** is the standard.
- **Automatic reconnection (exponential backoff)**: reconnect on detecting `on_close`. But **don't double-play the same utterance** — assign a unique ID to utterances and skip already-played IDs (idempotency).
- **Timeout and fallback**: if first_audio_delay exceeds a threshold (e.g., 1.5 seconds), switch to a **canned "please wait a moment" audio**, or fall back to batch TTS (resilience).
- **Visualize cost**: character count = billing amount. Per session, record "synthesized characters, estimated cost, first_audio_delay, interruption count," and correlate with [OpenTelemetry](/blog/opentelemetry-observability-production-tracing-metrics-logs).
- **Don't leave PII**: don't leave conversation text (which may contain personal info) in logs; metadata only.

```python
# セッション終了時に計測値を構造化ログへ（本文は出さない）
log.info("tts_session", extra={
    "session_id": tts.get_session_id(),
    "first_audio_delay_ms": tts.get_first_audio_delay(),
    "chars": total_chars,
    "barge_in_count": barge_ins,
})
```

---

## 7. Conclusion: a design checklist for a low-latency voice agent

- **The metric is first_audio_delay**: optimize "the first audio," not the total, and make it an SLI.
- **Pipeline**: when the LLM's first sentence is done, immediately TTS (send per sentence with `commit`).
- **3-layer configuration**: browser ⇄ your server ⇄ DashScope. Key on the server, playback in the browser.
- **Gapless playback**: time-schedule PCM 24kHz with `AudioBufferSourceNode`.
- **Barge-in**: instant stop at the playback layer closest to the user + generation interruption.
- **Resilience**: reconnection + utterance-ID idempotency, canned-audio fallback on timeout.
- **Observability**: correlate first_audio_delay, character count, cost, and interruption count in logs. Don't output PII text.

A voice agent is the design of "the pause" more than "smartness." In [the RAG-equipped voice-customer-service system](/case-studies/ai-voice-chatbot), I built in the speed of response and the naturalness of interruption. **"Turning your business into a no-waiting voice dialogue" — I can accompany you end-to-end from design through implementation and operation.** Feel free to consult from requirements organization.

---

### References (official documentation)

- [Qwen-TTS real-time speech synthesis (Alibaba Cloud Model Studio)](https://www.alibabacloud.com/help/en/model-studio/qwen-tts-realtime) — WebSocket protocol, SDK, events, modes
- [Qwen-TTS speech synthesis (Model Studio)](https://www.alibabacloud.com/help/en/model-studio/qwen-tts) — the basics of model, voice, parameters
- [MDN: Web Audio API](https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API) — scheduled playback with `AudioBufferSourceNode`
