Skip to main content
友田 陽大
Voice AI
Python
音声合成
生成AI
Qwen
リアルタイム
アーキテクチャ設計

Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)

A guide to production-implementing a low-latency voice agent that 'replies while talking' with Qwen3-TTS-Flash-Realtime. With real code it explains the WebSocket bidirectional protocol (session.created / response.audio.delta / session.finished), the use of server_commit vs. commit, streaming synthesis of LLM output, gapless PCM 24kHz playback in the browser, barge-in (interruption), connection resilience, and measuring first_audio_delay.

Published
Reading time
10 min read
Author
友田 陽大
Share

A voice agent's quality is decided less by smartness than by the pause (ma). No matter how good the answer, if it stays silent for 2 seconds from when the user asks until audio comes out, it feels "broken." Conversely, just by the first word coming back in 0.5 seconds and going silent instantly when the user starts talking, the experience suddenly becomes "dialogue."

This article is a guide to production-implementing a low-latency voice agent with Alibaba Cloud (Qwen)'s Qwen3-TTS-Flash-Realtime. The basics (model, voice, pricing) are left to the Qwen-TTS production-operations guide; this article concentrates on the design of "reply-while-talking" streaming. The RAG-customer-service context I actually built in the generative-AI voice chatbot.

Rules for this article: the API specs are based on the Alibaba Cloud Model Studio Qwen-TTS real-time speech-synthesis documentation (as of June 2026). Method names and event names are written per the official SDK, but the SDK is updated, so confirm the latest types in the official documentation before production. The API key assumes environment variables (hardcoding strictly forbidden, don't expose to the browser).


0. The big picture: the voice-agent loop and the latency budget

A real-time voice agent is a chain of 3 stages.

user utterance → ① STT (transcription) → ② LLM (response generation) → ③ TTS (speech synthesis) → playback
       ↑__________________________ barge-in (stop on interruption)__________________________|

What decides perception is not the total latency but the time until the first audio. So there are 2 iron rules of design.

  1. Pipeline the stages: don't wait for the LLM's full generation; flow to TTS the moment the first sentence is done.
  2. Make first_audio_delay a measurement target: make "the time to the first audio" a production SLI (metric) and monitor regressions (chapter 6).

"The first is fast" beats "the total is fast." Batch narration generation and dialogue (real-time) are different things. If batch suffices, not opening a WebSocket is simpler and cheaper (KISS). This article is about dialogue.


1. The basics of the realtime API: WebSocket and events

Qwen-TTS realtime runs on a full-duplex WebSocket.

  • Models: qwen3-tts-flash-realtime / qwen3-tts-instruct-flash-realtime
  • Endpoint (international): wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime (mainland China is dashscope.aliyuncs.com)
  • Audio arrives incrementally via the response.audio.delta event as base64-encoded PCM (24kHz/mono/16bit)

The official SDK receives server events with callbacks. The minimal implementation is like this.

import os
import base64
import threading
import dashscope
from dashscope.audio.qwen_tts_realtime import *

class TtsCallback(QwenTtsRealtimeCallback):
    """サーバーから届くイベントを処理する。音声断片はここに流れてくる。"""
    def __init__(self, sink) -> None:
        self._done = threading.Event()
        self._sink = sink  # 受け取ったPCMの行き先(ファイル / 再生 / WS中継)

    def on_open(self) -> None:
        # 接続確立。プレイヤーの初期化などはここで
        pass

    def on_close(self, code, msg) -> None:
        self._sink.close()

    def on_event(self, response: dict) -> None:
        etype = response["type"]
        if etype == "session.created":
            pass  # response["session"]["id"] でセッションIDが取れる
        elif etype == "response.audio.delta":
            pcm = base64.b64decode(response["delta"])  # 24kHz mono 16bit
            self._sink.write(pcm)                       # 即・下流へ(溜めない)
        elif etype == "response.done":
            pass  # 1応答ぶんの合成が完了
        elif etype == "session.finished":
            self._done.set()                            # セッション終了

    def wait(self) -> None:
        self._done.wait()

Connection and session setup:

dashscope.api_key = os.environ["DASHSCOPE_API_KEY"]
dashscope.base_websocket_api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"

callback = TtsCallback(sink)   # 自分で参照を保持(後で wait() する)
tts = QwenTtsRealtime(
    model="qwen3-tts-flash-realtime",
    callback=callback,
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
)
tts.connect()
tts.update_session(
    voice="Cherry",
    response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
    mode="server_commit",   # 第2章で commit と比較
)

How to send text changes by mode (next chapter). Basically, flow it in with append_text() and confirm with finish() (or commit()).

tts.append_text("本日はお問い合わせありがとうございます。")
tts.finish()       # server_commit: これ以上テキストが無いことを通知
callback.wait()    # session.finished まで待つ(保持した参照を使う)

2. server_commit and commit: who decides the segmentation

realtime has 2 modes that differ in how text is segmented, and you choose by use.

server_commit (server-driven)commit (client-driven)
Segmentationthe server smartly splits the textthe client makes it explicit with commit()
How to sendrepeat append_text()finish()append_text()commit() per utterance
Suited usecontinuous synthesis of articles / long narrationdialogue (want to segment per utterance)

When flowing LLM output in a voice agent, there are 2 options.

  • A. Send per sentence with commit: while receiving the LLM's tokens, segment by period/newline and commit(). You can make the first sentence into audio fastest (minimal first_audio_delay). The code increases by the amount of control gained.
  • B. Throw it all at server_commit: keep append_text()-ing the tokens as-is, and finish() at the end. The implementation is simple, but segmentation is left to the server.

When in doubt, dialogue is A (commit), article reading is B (server_commit). In dialogue where "the first audio fastest" is top priority, it's worth cutting the sentence boundaries yourself (ETC: easy to swap the segmentation logic later).


3. LLM → TTS pipeline: make the first sentence into audio fastest

The crux is "don't wait for the LLM's full text." When tokens flow in, pass them to TTS immediately at sentence breaks. Carve sentence splitting into a single-responsibility pure function (SRP, testability).

import re

_SENTENCE_END = re.compile(r"(?<=[。.!?!?\n])")

def split_complete_sentences(buffer: str) -> tuple[list[str], str]:
    """確定した文のリストと、未確定の残りを返す純関数。
    句点・改行までを「確定文」とし、途中の断片は次回に持ち越す。"""
    parts = _SENTENCE_END.split(buffer)
    if not buffer.endswith(("。", ".", "!", "?", "!", "?", "\n")):
        remainder = parts.pop() if parts else ""
    else:
        remainder = ""
    return [p for p in parts if p.strip()], remainder

def speak_llm_stream(tts: "QwenTtsRealtime", token_stream) -> None:
    """LLMのトークンストリームを、文ができ次第TTSへ流す(commitモード)。"""
    buf = ""
    for token in token_stream:        # ②LLMのストリーミング出力
        buf += token
        sentences, buf = split_complete_sentences(buf)
        for s in sentences:
            tts.append_text(s)
            tts.commit()              # 文ごとに確定 → 最初の音が早く出る
    if buf.strip():                   # 末尾の残り
        tts.append_text(buf)
    tts.finish()

Beware backpressure: playback is slower than the LLM's generation (audio advances in real time), so calling append_text without limit swells the internal queue. For long text, watch "seconds played" and throttle the sending, or leave it to the server with server_commit. Audio is bound by the physics of real time — the decisive difference from a text stream.


4. Browser playback: play PCM without exposing the key

Don't hit DashScope directly from the browser (API-key leak). The correct configuration is 3 layers.

Browser  ⇄  your server (Next.js)  ⇄  DashScope WS
(audio playback)   (holds key, WS relay, authorization)   (synthesis)

The server (Next.js / Node) opens a WebSocket with the browser, internally connects to DashScope realtime, and relays the audio fragments. The browser plays the arrived PCM gaplessly with the Web Audio API. A player that plays 24kHz/mono/16bit PCM as-is can be written like this.

/** 24kHz mono 16bit PCM のチャンクを、隙間なく順に再生するプレイヤー。
 *  AudioBufferSourceNode を時刻指定でスケジュールし、途切れを防ぐ。 */
export class PcmStreamPlayer {
  private ctx: AudioContext;
  private nextStartTime = 0;
  private readonly sources = new Set<AudioBufferSourceNode>();

  constructor(private readonly sampleRate = 24_000) {
    this.ctx = new AudioContext({ sampleRate });
  }

  /** Int16 PCM(リトルエンディアン)を1チャンク受け取り、末尾に連結再生する。 */
  enqueue(pcm: ArrayBuffer): void {
    const int16 = new Int16Array(pcm);
    const float32 = new Float32Array(int16.length);
    for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 0x8000; // -1..1へ正規化

    const buffer = this.ctx.createBuffer(1, float32.length, this.sampleRate);
    buffer.getChannelData(0).set(float32);

    const src = this.ctx.createBufferSource();
    src.buffer = buffer;
    src.connect(this.ctx.destination);

    const startAt = Math.max(this.ctx.currentTime, this.nextStartTime);
    src.start(startAt);                       // 前のチャンクの直後にスケジュール
    this.nextStartTime = startAt + buffer.duration;

    this.sources.add(src);
    src.onended = () => this.sources.delete(src);
  }

  /** バージイン(割り込み):再生中の音を即座に止め、キューを捨てる。 */
  stop(): void {
    for (const s of this.sources) { try { s.stop(); } catch {} }
    this.sources.clear();
    this.nextStartTime = this.ctx.currentTime;
  }
}

The browser-side receive loop (play the PCM the server relayed):

const player = new PcmStreamPlayer();
const ws = new WebSocket(`wss://${location.host}/api/voice`); // 自サーバーのWS
ws.binaryType = "arraybuffer";

ws.onmessage = (e) => {
  if (e.data instanceof ArrayBuffer) player.enqueue(e.data); // 音声PCM
  // 制御メッセージ(テキスト)は別途JSONで受ける設計にする
};

Start the AudioContext on a user gesture: many browsers block autoplay. Call new AudioContext() / ctx.resume() on the onClick of a "Start conversation" button to enable audio from a user origin (this is a demand of both a11y and browser policy).

Always authorize the relay WS (/api/voice): this endpoint is a billing path that holds the DashScope API key. Before relaying, verify a session Cookie or a short-lived token at the handshake stage. If unauthenticated, a third party uses TTS on your bill (protect with the relay's authorization instead of exposing the key to the browser).


5. Barge-in (interruption): go silent instantly when talked to

The most important feature that makes a dialogue a "dialogue" is barge-in. When the user starts talking during the agent's utterance, ① stop generation ② discard the audio being played instantly.

  1. When you detect the user's utterance with VAD (voice activity detection),
  2. discard the in-progress TTS session on the server side (don't relay subsequent response.audio.delta),
  3. flush the playback buffer with the browser's player.stop().
// ブラウザ:ユーザーが話し始めた合図を受けたら即停止
function onUserStartedSpeaking() {
  player.stop();                         // 再生中の音を捨てる(残響を残さない)
  ws.send(JSON.stringify({ type: "barge_in" })); // サーバーに生成中断を依頼
}

The crux is "give the responsibility for stopping to the playback side." Even if the server's interruption is a moment late, if the browser goes silent first, perception isn't interrupted. In the world of latency, stopping at the layer closest to the user is the right answer.


6. Resilience, idempotency, observability

WebSockets disconnect. realtime also bills. For production, put in the following.

  • Measure first_audio_delay: get "the time to the first audio" with the SDK's get_first_audio_delay() and output it to structured logs / metrics. The official explicitly notes "the first time is slow at the start because it includes establishing the WS connection," so pooling connections and keeping them warm is the standard.
  • Automatic reconnection (exponential backoff): reconnect on detecting on_close. But don't double-play the same utterance — assign a unique ID to utterances and skip already-played IDs (idempotency).
  • Timeout and fallback: if first_audio_delay exceeds a threshold (e.g., 1.5 seconds), switch to a canned "please wait a moment" audio, or fall back to batch TTS (resilience).
  • Visualize cost: character count = billing amount. Per session, record "synthesized characters, estimated cost, first_audio_delay, interruption count," and correlate with OpenTelemetry.
  • Don't leave PII: don't leave conversation text (which may contain personal info) in logs; metadata only.
# セッション終了時に計測値を構造化ログへ(本文は出さない)
log.info("tts_session", extra={
    "session_id": tts.get_session_id(),
    "first_audio_delay_ms": tts.get_first_audio_delay(),
    "chars": total_chars,
    "barge_in_count": barge_ins,
})

7. Conclusion: a design checklist for a low-latency voice agent

  • The metric is first_audio_delay: optimize "the first audio," not the total, and make it an SLI.
  • Pipeline: when the LLM's first sentence is done, immediately TTS (send per sentence with commit).
  • 3-layer configuration: browser ⇄ your server ⇄ DashScope. Key on the server, playback in the browser.
  • Gapless playback: time-schedule PCM 24kHz with AudioBufferSourceNode.
  • Barge-in: instant stop at the playback layer closest to the user + generation interruption.
  • Resilience: reconnection + utterance-ID idempotency, canned-audio fallback on timeout.
  • Observability: correlate first_audio_delay, character count, cost, and interruption count in logs. Don't output PII text.

A voice agent is the design of "the pause" more than "smartness." In the RAG-equipped voice-customer-service system, I built in the speed of response and the naturalness of interruption. "Turning your business into a no-waiting voice dialogue" — I can accompany you end-to-end from design through implementation and operation. Feel free to consult from requirements organization.


References (official documentation)

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading