A voice agent's quality is decided less by smartness than by the pause (ma). No matter how good the answer, if it stays silent for 2 seconds from when the user asks until audio comes out, it feels "broken." Conversely, just by the first word coming back in 0.5 seconds and going silent instantly when the user starts talking, the experience suddenly becomes "dialogue."
This article is a guide to production-implementing a low-latency voice agent with Alibaba Cloud (Qwen)'s Qwen3-TTS-Flash-Realtime. The basics (model, voice, pricing) are left to the Qwen-TTS production-operations guide; this article concentrates on the design of "reply-while-talking" streaming. The RAG-customer-service context I actually built in the generative-AI voice chatbot.
Rules for this article: the API specs are based on the Alibaba Cloud Model Studio Qwen-TTS real-time speech-synthesis documentation (as of June 2026). Method names and event names are written per the official SDK, but the SDK is updated, so confirm the latest types in the official documentation before production. The API key assumes environment variables (hardcoding strictly forbidden, don't expose to the browser).
0. The big picture: the voice-agent loop and the latency budget
A real-time voice agent is a chain of 3 stages.
user utterance → ① STT (transcription) → ② LLM (response generation) → ③ TTS (speech synthesis) → playback
↑__________________________ barge-in (stop on interruption)__________________________|
- ① STT is Whisper / gpt-4o-transcribe streaming.
- ② LLM response streaming is the Vercel AI SDK production guide.
- ③ TTS is the protagonist of this article. Qwen-TTS realtime handles it.
What decides perception is not the total latency but the time until the first audio. So there are 2 iron rules of design.
- Pipeline the stages: don't wait for the LLM's full generation; flow to TTS the moment the first sentence is done.
- Make first_audio_delay a measurement target: make "the time to the first audio" a production SLI (metric) and monitor regressions (chapter 6).
"The first is fast" beats "the total is fast." Batch narration generation and dialogue (real-time) are different things. If batch suffices, not opening a WebSocket is simpler and cheaper (KISS). This article is about dialogue.
1. The basics of the realtime API: WebSocket and events
Qwen-TTS realtime runs on a full-duplex WebSocket.
- Models:
qwen3-tts-flash-realtime/qwen3-tts-instruct-flash-realtime - Endpoint (international):
wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime(mainland China isdashscope.aliyuncs.com) - Audio arrives incrementally via the
response.audio.deltaevent as base64-encoded PCM (24kHz/mono/16bit)
The official SDK receives server events with callbacks. The minimal implementation is like this.
import os
import base64
import threading
import dashscope
from dashscope.audio.qwen_tts_realtime import *
class TtsCallback(QwenTtsRealtimeCallback):
"""サーバーから届くイベントを処理する。音声断片はここに流れてくる。"""
def __init__(self, sink) -> None:
self._done = threading.Event()
self._sink = sink # 受け取ったPCMの行き先(ファイル / 再生 / WS中継)
def on_open(self) -> None:
# 接続確立。プレイヤーの初期化などはここで
pass
def on_close(self, code, msg) -> None:
self._sink.close()
def on_event(self, response: dict) -> None:
etype = response["type"]
if etype == "session.created":
pass # response["session"]["id"] でセッションIDが取れる
elif etype == "response.audio.delta":
pcm = base64.b64decode(response["delta"]) # 24kHz mono 16bit
self._sink.write(pcm) # 即・下流へ(溜めない)
elif etype == "response.done":
pass # 1応答ぶんの合成が完了
elif etype == "session.finished":
self._done.set() # セッション終了
def wait(self) -> None:
self._done.wait()
Connection and session setup:
dashscope.api_key = os.environ["DASHSCOPE_API_KEY"]
dashscope.base_websocket_api_url = "wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime"
callback = TtsCallback(sink) # 自分で参照を保持(後で wait() する)
tts = QwenTtsRealtime(
model="qwen3-tts-flash-realtime",
callback=callback,
url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
)
tts.connect()
tts.update_session(
voice="Cherry",
response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,
mode="server_commit", # 第2章で commit と比較
)
How to send text changes by mode (next chapter). Basically, flow it in with append_text() and confirm with finish() (or commit()).
tts.append_text("本日はお問い合わせありがとうございます。")
tts.finish() # server_commit: これ以上テキストが無いことを通知
callback.wait() # session.finished まで待つ(保持した参照を使う)
2. server_commit and commit: who decides the segmentation
realtime has 2 modes that differ in how text is segmented, and you choose by use.
server_commit (server-driven) | commit (client-driven) | |
|---|---|---|
| Segmentation | the server smartly splits the text | the client makes it explicit with commit() |
| How to send | repeat append_text() → finish() | append_text() → commit() per utterance |
| Suited use | continuous synthesis of articles / long narration | dialogue (want to segment per utterance) |
When flowing LLM output in a voice agent, there are 2 options.
- A. Send per sentence with
commit: while receiving the LLM's tokens, segment by period/newline andcommit(). You can make the first sentence into audio fastest (minimal first_audio_delay). The code increases by the amount of control gained. - B. Throw it all at
server_commit: keepappend_text()-ing the tokens as-is, andfinish()at the end. The implementation is simple, but segmentation is left to the server.
When in doubt, dialogue is A (commit), article reading is B (server_commit). In dialogue where "the first audio fastest" is top priority, it's worth cutting the sentence boundaries yourself (ETC: easy to swap the segmentation logic later).
3. LLM → TTS pipeline: make the first sentence into audio fastest
The crux is "don't wait for the LLM's full text." When tokens flow in, pass them to TTS immediately at sentence breaks. Carve sentence splitting into a single-responsibility pure function (SRP, testability).
import re
_SENTENCE_END = re.compile(r"(?<=[。.!?!?\n])")
def split_complete_sentences(buffer: str) -> tuple[list[str], str]:
"""確定した文のリストと、未確定の残りを返す純関数。
句点・改行までを「確定文」とし、途中の断片は次回に持ち越す。"""
parts = _SENTENCE_END.split(buffer)
if not buffer.endswith(("。", ".", "!", "?", "!", "?", "\n")):
remainder = parts.pop() if parts else ""
else:
remainder = ""
return [p for p in parts if p.strip()], remainder
def speak_llm_stream(tts: "QwenTtsRealtime", token_stream) -> None:
"""LLMのトークンストリームを、文ができ次第TTSへ流す(commitモード)。"""
buf = ""
for token in token_stream: # ②LLMのストリーミング出力
buf += token
sentences, buf = split_complete_sentences(buf)
for s in sentences:
tts.append_text(s)
tts.commit() # 文ごとに確定 → 最初の音が早く出る
if buf.strip(): # 末尾の残り
tts.append_text(buf)
tts.finish()
Beware backpressure: playback is slower than the LLM's generation (audio advances in real time), so calling
append_textwithout limit swells the internal queue. For long text, watch "seconds played" and throttle the sending, or leave it to the server withserver_commit. Audio is bound by the physics of real time — the decisive difference from a text stream.
4. Browser playback: play PCM without exposing the key
Don't hit DashScope directly from the browser (API-key leak). The correct configuration is 3 layers.
Browser ⇄ your server (Next.js) ⇄ DashScope WS
(audio playback) (holds key, WS relay, authorization) (synthesis)
The server (Next.js / Node) opens a WebSocket with the browser, internally connects to DashScope realtime, and relays the audio fragments. The browser plays the arrived PCM gaplessly with the Web Audio API. A player that plays 24kHz/mono/16bit PCM as-is can be written like this.
/** 24kHz mono 16bit PCM のチャンクを、隙間なく順に再生するプレイヤー。
* AudioBufferSourceNode を時刻指定でスケジュールし、途切れを防ぐ。 */
export class PcmStreamPlayer {
private ctx: AudioContext;
private nextStartTime = 0;
private readonly sources = new Set<AudioBufferSourceNode>();
constructor(private readonly sampleRate = 24_000) {
this.ctx = new AudioContext({ sampleRate });
}
/** Int16 PCM(リトルエンディアン)を1チャンク受け取り、末尾に連結再生する。 */
enqueue(pcm: ArrayBuffer): void {
const int16 = new Int16Array(pcm);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 0x8000; // -1..1へ正規化
const buffer = this.ctx.createBuffer(1, float32.length, this.sampleRate);
buffer.getChannelData(0).set(float32);
const src = this.ctx.createBufferSource();
src.buffer = buffer;
src.connect(this.ctx.destination);
const startAt = Math.max(this.ctx.currentTime, this.nextStartTime);
src.start(startAt); // 前のチャンクの直後にスケジュール
this.nextStartTime = startAt + buffer.duration;
this.sources.add(src);
src.onended = () => this.sources.delete(src);
}
/** バージイン(割り込み):再生中の音を即座に止め、キューを捨てる。 */
stop(): void {
for (const s of this.sources) { try { s.stop(); } catch {} }
this.sources.clear();
this.nextStartTime = this.ctx.currentTime;
}
}
The browser-side receive loop (play the PCM the server relayed):
const player = new PcmStreamPlayer();
const ws = new WebSocket(`wss://${location.host}/api/voice`); // 自サーバーのWS
ws.binaryType = "arraybuffer";
ws.onmessage = (e) => {
if (e.data instanceof ArrayBuffer) player.enqueue(e.data); // 音声PCM
// 制御メッセージ(テキスト)は別途JSONで受ける設計にする
};
Start the
AudioContexton a user gesture: many browsers block autoplay. Callnew AudioContext()/ctx.resume()on theonClickof a "Start conversation" button to enable audio from a user origin (this is a demand of both a11y and browser policy).
Always authorize the relay WS (
/api/voice): this endpoint is a billing path that holds the DashScope API key. Before relaying, verify a session Cookie or a short-lived token at the handshake stage. If unauthenticated, a third party uses TTS on your bill (protect with the relay's authorization instead of exposing the key to the browser).
5. Barge-in (interruption): go silent instantly when talked to
The most important feature that makes a dialogue a "dialogue" is barge-in. When the user starts talking during the agent's utterance, ① stop generation ② discard the audio being played instantly.
- When you detect the user's utterance with VAD (voice activity detection),
- discard the in-progress TTS session on the server side (don't relay subsequent
response.audio.delta), - flush the playback buffer with the browser's
player.stop().
// ブラウザ:ユーザーが話し始めた合図を受けたら即停止
function onUserStartedSpeaking() {
player.stop(); // 再生中の音を捨てる(残響を残さない)
ws.send(JSON.stringify({ type: "barge_in" })); // サーバーに生成中断を依頼
}
The crux is "give the responsibility for stopping to the playback side." Even if the server's interruption is a moment late, if the browser goes silent first, perception isn't interrupted. In the world of latency, stopping at the layer closest to the user is the right answer.
6. Resilience, idempotency, observability
WebSockets disconnect. realtime also bills. For production, put in the following.
- Measure first_audio_delay: get "the time to the first audio" with the SDK's
get_first_audio_delay()and output it to structured logs / metrics. The official explicitly notes "the first time is slow at the start because it includes establishing the WS connection," so pooling connections and keeping them warm is the standard. - Automatic reconnection (exponential backoff): reconnect on detecting
on_close. But don't double-play the same utterance — assign a unique ID to utterances and skip already-played IDs (idempotency). - Timeout and fallback: if first_audio_delay exceeds a threshold (e.g., 1.5 seconds), switch to a canned "please wait a moment" audio, or fall back to batch TTS (resilience).
- Visualize cost: character count = billing amount. Per session, record "synthesized characters, estimated cost, first_audio_delay, interruption count," and correlate with OpenTelemetry.
- Don't leave PII: don't leave conversation text (which may contain personal info) in logs; metadata only.
# セッション終了時に計測値を構造化ログへ(本文は出さない)
log.info("tts_session", extra={
"session_id": tts.get_session_id(),
"first_audio_delay_ms": tts.get_first_audio_delay(),
"chars": total_chars,
"barge_in_count": barge_ins,
})
7. Conclusion: a design checklist for a low-latency voice agent
- The metric is first_audio_delay: optimize "the first audio," not the total, and make it an SLI.
- Pipeline: when the LLM's first sentence is done, immediately TTS (send per sentence with
commit). - 3-layer configuration: browser ⇄ your server ⇄ DashScope. Key on the server, playback in the browser.
- Gapless playback: time-schedule PCM 24kHz with
AudioBufferSourceNode. - Barge-in: instant stop at the playback layer closest to the user + generation interruption.
- Resilience: reconnection + utterance-ID idempotency, canned-audio fallback on timeout.
- Observability: correlate first_audio_delay, character count, cost, and interruption count in logs. Don't output PII text.
A voice agent is the design of "the pause" more than "smartness." In the RAG-equipped voice-customer-service system, I built in the speed of response and the naturalness of interruption. "Turning your business into a no-waiting voice dialogue" — I can accompany you end-to-end from design through implementation and operation. Feel free to consult from requirements organization.
References (official documentation)
- Qwen-TTS real-time speech synthesis (Alibaba Cloud Model Studio) — WebSocket protocol, SDK, events, modes
- Qwen-TTS speech synthesis (Model Studio) — the basics of model, voice, parameters
- MDN: Web Audio API — scheduled playback with
AudioBufferSourceNode