"I want to turn text into natural speech" — as a requirement it's one line. But the moment you try to put it into production, the things to decide explode at once. Do you call a managed API, or run the OSS version yourself? Which timbre (voice) do you choose? Japanese only, or 10-language multilingual narration? Do you return it while speaking in real time, or mass-produce audio files in batch? How do you plug the pitfall that the generated URL disappears in 24 hours? And — how do you hold the line of not replicating someone else's voice without permission?
This article is an implementation guide for operating Alibaba Cloud's (Tongyi/Qwen) speech-synthesis models Qwen-TTS / Qwen3-TTS at production quality. I covered the transcription (STT) side's design in the OpenAI Whisper production guide. This article explains the paired speech-generation (TTS) side design, weaving in the knowledge I actually gained building a multilingual dubbing pipeline (the AI video localization platform). The map and technology selection for voice AI as a whole (STT / TTS / voice agents) is summarized in the voice AI production guide.
The rule of this article: Model and API specs are based on the contents of the Qwen official blog, the Alibaba Cloud Model Studio docs, and GitHub (QwenLM/Qwen3-TTS) as of June 2026. Because pricing, model names, and supported languages get revised, always check the latest values at the official documentation before going to production. The code is shaped into a form usable in real operations, but API keys are assumed to be in environment variables (hardcoding strictly forbidden).
0. The first map: "Qwen-TTS" has two entities
Start designing while confusing this, and you'll definitely get stuck later. The term "Qwen-TTS" refers to two delivery forms of differing natures.
| ① Managed API (DashScope / Model Studio) | ② OSS version (GitHub, open-weight) | |
|---|---|---|
| Entity | The dashscope SDK / HTTP endpoint | Qwen3-TTS-12Hz-* (Hugging Face weights) |
| Representative models | qwen3-tts-flash / qwen3-tts-instruct-flash / *-realtime | 1.7B-Base / 1.7B-CustomVoice / 1.7B-VoiceDesign |
| Where it runs | Alibaba Cloud's servers (you send the text) | Your own GPU (text doesn't leave) |
| License | Commercial use follows Model Studio's terms | Apache-2.0 (you own and can modify the weights) |
| Billing | Per-character usage-based | Fixed cost of compute resources (GPU time) |
| Strengths | Zero ops, same-day adoption, 49+ timbres | Data sovereignty, 3-second voice cloning, unlimited |
When you say "use Qwen-TTS," first settle which — ① or ② — you mean. This article covers both and presents the selection criteria (Section 7). For many projects, "first validate value with ①'s API → consider ② once a custom voice or data sovereignty becomes a requirement" is the right order (YAGNI).
1. What is Qwen3-TTS-Flash (the official numbers)
The current flagship is Qwen3-TTS-Flash. The numbers the official docs publish are as follows.
- 49+ timbres: covering gender, age, regionality, and character.
- 10 languages: Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian.
- 9 Chinese dialects: Mandarin, Hokkien, Wu (Shanghainese), Cantonese, Sichuanese, Beijing, Nanjing, Tianjin, Shaanxi.
- Trained on over 5 million hours of audio data.
- Benchmarks: SOTA stability for Chinese and English on the seed-tts-eval test set (claimed to surpass SeedTTS, MiniMax, and GPT-4o-Audio-Preview). On MiniMax's multilingual test set too, claimed to have a lower average WER (word error rate) than MiniMax, ElevenLabs, and GPT-4o-Audio-Preview.
For reference, the first-generation Qwen-TTS was released with 7 timbres (Cherry, Ethan, Chelsie, Serena, Dylan, Jada, Sunny), recording WER 1.209–2.206 and speaker similarity (SIM) 0.473–0.804 on SeedTTS-Eval. If you grasp the Qwen3 generation as a lineage that greatly expanded timbres, languages, dialects, and stability from there, your selection outlook becomes clear.
1.1 Model lineage and Regions (this is a pitfall)
Model Studio's model names differ by Region. Note that the lineup differs between International (Singapore) and Mainland China (Beijing).
| Model | Role | International (Singapore) | Mainland China (Beijing) |
|---|---|---|---|
qwen3-tts-flash | Standard fast synthesis (49+ timbres) | ✅ | ✅ |
qwen3-tts-instruct-flash | Instruction-control of speaking style in natural language | ✅ | ✅ |
qwen3-tts-flash-realtime | Incremental synthesis over WebSocket | ✅ | ✅ |
qwen3-tts-instruct-flash-realtime | Realtime with instruction control | ✅ | ✅ |
qwen-tts / qwen-tts-latest | First generation (old name) | — | ✅ only |
qwen3-tts-flashis the stable-version alias, which at the time of writing corresponds to the snapshotqwen3-tts-flash-2025-11-27(you can also specify past versions like-2025-09-18). In production, fixing the snapshot name avoids timbre behavior changes from alias updates (reproducibility / testability).- The endpoint host also differs by Region. International is
dashscope-intl.aliyuncs.com, Mainland China isdashscope.aliyuncs.com. The first-generationqwen-ttsis provided only in the Mainland China Region, so for requirements that can only use the International Region, chooseqwen3-tts-flash. - From the standpoint of data residency too, which Region the text is sent to is a design decision. If cross-border transfer of confidential text is a problem, you need to consider Region selection or the OSS version (Section 6).
2. Get it running in 5 minutes: the DashScope API (Python)
First, generate one piece of audio by the shortest path. The SDK is dashscope, and the call is MultiModalConversation.call.
pip install -U dashscope
import os
import dashscope
# 国際リージョン(シンガポール)を使う。中国本土なら .aliyuncs.com 側を指定。
dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"
response = dashscope.MultiModalConversation.call(
model="qwen3-tts-flash",
api_key=os.getenv("DASHSCOPE_API_KEY"), # キーは環境変数から(ハードコード禁止)
text="本日はお集まりいただき、ありがとうございます。",
voice="Cherry", # 音色(声)。第4章の一覧から選ぶ
language_type="Japanese", # テキストの言語に合わせる(後述)
stream=False, # 一括生成(URLで受け取る)
)
# 非ストリーミングでは「音声ファイルのURL」が返る(生成から24時間で失効)
audio_url = response.output.audio.url
print(audio_url)
The biggest pitfall here is that response.output.audio.url is a temporary URL that expires 24 hours after generation. In production, save it to your own storage (S3 / Vercel Blob, etc.) immediately after receiving it, then use it in your business logic.
import requests
def fetch_and_store(audio_url: str, dest: str) -> None:
"""一時URLは24時間で消える。受領直後に自前ストレージへ退避するのが鉄則。"""
res = requests.get(audio_url, timeout=30)
res.raise_for_status()
with open(dest, "wb") as f:
f.write(res.content) # 実運用では S3/Blob の put に置き換える
A tip on
language_type: the official docs "recommend matching it to the text's language." For Japanese text,"Japanese"; for English,"English". Even with the same timbre (e.g. Cherry), just switchinglanguage_typelets you narrate the same script across 10 languages differently — a strength of Qwen3-TTS (directly relevant to multilingual e-learning / dubbing; Section 10).
3. The HTTP API: calling it from a TypeScript / Next.js server
If you're using a language without an SDK, or calling from TypeScript / Next.js like this site, hit HTTP directly. Don't call it from the browser; always execute it server-side (Route Handler / Server Action) to keep the key secret.
The endpoint (International Region):
POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
A Next.js Route Handler example. Validate the input at the boundary with Zod, and constrain timbre and language with an allow-list (don't pass user-derived values through raw).
// app/api/tts/route.ts — サーバー側でのみ実行(APIキーはブラウザに出さない)
import { z } from "zod";
// 許可する音色・言語を型で固定(不正値をAPIに渡さない=コスト事故と例外を防ぐ)
const VOICES = ["Cherry", "Ethan", "Serena", "Jennifer", "Ryan"] as const;
const LANGS = ["Japanese", "English", "Chinese", "Korean"] as const;
const Body = z.object({
text: z.string().min(1).max(2000), // 上限は要件に合わせる(課金は文字数ベース)
voice: z.enum(VOICES),
language_type: z.enum(LANGS),
});
export async function POST(req: Request) {
const parsed = Body.safeParse(await req.json());
if (!parsed.success) {
return Response.json({ error: parsed.error.flatten() }, { status: 400 });
}
const upstream = await fetch(
"https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.DASHSCOPE_API_KEY!}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "qwen3-tts-flash",
input: parsed.data, // { text, voice, language_type }
}),
},
);
if (!upstream.ok) {
// 上流のエラーはそのまま返さず、こちらで分類してログ(内部詳細は漏らさない)
return Response.json({ error: "tts_upstream_failed" }, { status: 502 });
}
const json = (await upstream.json()) as { output: { audio: { url: string } } };
return Response.json({ url: json.output.audio.url }); // 受領後は自前ストレージへ
}
The request body's shape is { "model": ..., "input": { "text", "voice", "language_type" } }, and the response is output.audio.url. Only when using streaming, add X-DashScope-SSE: enable to the header (next section).
4. Choosing and manipulating the voice: timbre, language, dialect, instruction control
4.1 How to choose the timbre (voice)
From 49+ timbres, it's practical to decide on a representative timbre by use case. Representative examples (summarizing the official descriptions):
| Use case | Representative timbre | Official description (summary) |
|---|---|---|
| General narration (female, bright) | Cherry | A young woman, bright and friendly like the sun |
| General narration (male, standard) | Ethan | Standard Mandarin with a slight northern accent. Energetic and warm |
| English cinema-quality (female) | Jennifer | A premium, movie-quality American-English female |
| Drama / trailers (male) | Ryan | Rhythmic, with dramatic intonation |
| Calm academic / explanatory | Elias | Academic rigor using storytelling techniques |
All standard timbres support 10 languages. That is, you can run an operation that keeps the timbre fixed and switches only the language — "Japanese narration in Cherry, the English version of the same script also in Cherry." Being able to deploy multilingually while keeping voice-brand consistency pays off in localization projects.
4.2 Producing Chinese dialects (dialects are dedicated timbres)
Dialects are tied to the timbre side. To produce the targeted regionality, choose the corresponding timbre.
| Dialect | Representative timbre | Dialect | Representative timbre |
|---|---|---|---|
| Beijing | Dylan | Shanghai (Wu) | Jada |
| Sichuan | Sunny / Eric | Cantonese | Rocky / Kiki |
| Tianjin | Peter | Nanjing | Li |
| Shaanxi | Marcus | Hokkien (Taiwan) | Roy |
For character voices, local commercials, and entertainment uses aimed at the Chinese-speaking sphere, these dialect timbres become a differentiator.
4.3 Controlling speaking style "with words": the instruct model
Rather than touching speed, intonation, and emotion with fine parameters, qwen3-tts-instruct-flash lets you control them with natural-language instructions.
response = dashscope.MultiModalConversation.call(
model="qwen3-tts-instruct-flash",
api_key=os.getenv("DASHSCOPE_API_KEY"),
text="期間限定セール、本日スタートです!",
voice="Cherry",
language_type="Japanese",
# 話し方を自然言語で指定(速度・抑揚・感情)
instructions="やや速めのテンポで、明るく弾むような上がり調子。ファッション商品の紹介向け。",
optimize_instructions=True, # 指示文をモデル側で最適化処理する
stream=False,
)
According to the official docs, instructions is up to 1,600 tokens, and the supported languages are Chinese and English (the language of the instruction text itself; operations like reading Japanese body text while writing the instruction in English are also possible). Envisioned uses listed include audiobooks, radio dramas, ads, game-character voiceovers, voice assistants, and documentary narration.
Design judgment: holding the speaking style not as a code parameter but as a prompt (natural language) lets a director (a non-engineer) directly adjust the direction. This is a design that externalizes "direction = data," so a deploy isn't needed on every revision (ETC: ease of change). On the other hand, for uses that require reproducibility, version-manage a snapshot model + fixed instructions.
5. Real-time speech synthesis: speak while generating (WebSocket)
In scenes like voice assistants and conversational UIs where "you don't want to make the user wait for the whole text to finish generating," use the *-realtime models over WebSocket and play incrementally while generating.
- Endpoint (International):
wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime - Models:
qwen3-tts-flash-realtime/qwen3-tts-instruct-flash-realtime
The Python SDK receives audio chunks via a callback.
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, AudioFormat
rt = QwenTtsRealtime(
model="qwen3-tts-flash-realtime",
callback=callback, # response.audio.delta(base64音声)を逐次受け取る
url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
)
rt.connect()
rt.update_session(
voice="Cherry",
response_format=AudioFormat.PCM_24000HZ_MONO_16BIT, # 24kHz/モノ/16bit
mode="server_commit", # サーバーがテキストを賢く区切る(後述)
)
# 以降、テキストを流し込むと response.audio.delta イベントで音声断片が届く
5.1 The two modes: server_commit and commit
Realtime has two modes that differ in how text is segmented. Choose by requirement.
server_commit(server-led): the server side intelligently splits the text. Suited for continuously synthesizing long passages (reading articles aloud, narration).commit(client-led): the client manually commits the text buffer to trigger synthesis. When you want precise control in a conversational scenario (segmenting per utterance in a chat).
5.2 How to think about latency
The SDK can measure first_audio_delay (the time from sending the request to receiving the first audio fragment). The official docs explicitly state that "because the first text send includes establishing the WebSocket connection, the initial first-packet latency includes the connection setup time." That is, the standard is to reuse the connection and keep it warm. Note that the OSS version (Section 6) claims an end-to-end synthesis latency of 97ms, and this initial response is what determines the perceived quality of a conversation.
Recorded text's "batch generation" (Section 2) and a conversation's "live synthesis" (this section) are different things. Mistake the requirement and you'll either open WebSockets needlessly or, conversely, create a UX that makes users wait.
The details of the WebSocket protocol, gapless PCM playback in the browser, barge-in (interruption), and pipelining LLM → TTS are dug into in the Qwen-TTS real-time voice agent implementation guide.
6. Creating your own voice: the OSS version (voice cloning & voice design)
"I want it read in my (or our talent's) voice," "I can't let data leave," "I want to run it unlimited at a fixed cost" — for those requirements, the OSS version (QwenLM/Qwen3-TTS), released under Apache-2.0, becomes an option.
The main published checkpoints:
| Model | Features |
|---|---|
Qwen3-TTS-12Hz-1.7B-Base | Voice cloning from 3 seconds of reference audio |
Qwen3-TTS-12Hz-1.7B-CustomVoice | 9 preset speakers (Vivian / Serena / Uncle_Fu / Dylan / Eric / Ryan / Aiden / Ono_Anna / Sohee) |
Qwen3-TTS-12Hz-1.7B-VoiceDesign | Design the voice itself in natural language |
*-0.6B-Base / *-0.6B-CustomVoice | Lightweight version (memory-saving) |
pip install -U qwen-tts # FlashAttention 2 併用を推奨。GPU(bf16/fp16)前提
Voice cloning (from 3 seconds of reference audio + its transcript, read arbitrary text in the same voice):
import torch
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
)
wavs, sr = model.generate_voice_clone(
text="この声で、好きな原稿を読み上げます。",
language="Japanese",
ref_audio="ref.wav", # 3秒程度の参照音声
ref_text="参照音声の書き起こし", # 参照音声に対応するテキスト
)
Voice design ("ordering" the voice's texture in natural language):
wavs, sr = model.generate_voice_design(
text="ようこそ、いらっしゃいませ。",
language="Japanese",
instruct="落ち着いた中低音の男性。ホテルのコンシェルジュのように丁寧で温かい。",
)
The OSS version's benchmarks (official README): 1.7B-Base scores 1.24 WER on SEED test-en, Chinese speaker similarity 0.811, and 2.356 WER on long-zh. It supports 10 languages.
Ethics and legal are not a "feature" but a "premise": voice cloning is a technology that makes it possible to replicate a voice without the person's consent. Because there are risks of impersonation, fraud, and defamation, in production it's a mandatory condition to build into both the design and the contract "① the written consent of the voice provider," "② explicit, limited use," and "③ disclosure that the output is AI audio." This is a line you can't drop in productionizing any voice AI, not just Qwen-TTS.
The OSS version's self-hosted implementation (a FastAPI inference server) and the governance design covering a consent ledger, use limitation, expiry, disclosure, provenance, and audit logs are detailed in the Qwen-TTS voice-cloning production implementation guide.
7. A selection framework: how to decide API vs. OSS
It's not "which is correct," but choosing the one that fits your requirements along four axes.
| Decision axis | OSS version (self-hosted) is favorable | The managed API is favorable |
|---|---|---|
| Privacy / data sovereignty | You can't let text/voice leave (medical, government, confidential) | External transmission is acceptable |
| Cost structure | You want fixed cost for high-volume / always-on (GPU amortization) | Low-volume / spiky, where variable cost is reasonable |
| Customizability | A custom voice (clone/design) is a requirement | The 49+ ready-made timbres suffice |
| Operational structure | You can operate GPU, model updates, and scaling | You don't want to operate; you want accuracy same-day |
Cost intuition (verify against the official source)
Model Studio's billing is per-character. By a rough estimate from public information, qwen3-tts-flash is roughly about $0.013 per 1,000 characters (≈ about $13 per million characters), realtime is priced separately, and new users are said to have a free tier (roughly around a million characters). However, these amounts vary and differ by Region, so always check the latest values on the official pricing page.
The decision pattern is this: for around tens of thousands to hundreds of thousands of characters per month, the API is overwhelmingly cheaper and zero-ops. Conversely, if you generate large volumes of narration constantly every day / need a custom voice / can't let text leave, a break-even point appears where the fixed cost of running the OSS version on a GPU wins on unit price and requirement fit. "First validate value with the API → consider migrating to OSS once volume and custom requirements solidify" is the right order for many projects (KISS / cost efficiency).
A side-by-side comparison with other TTS (ElevenLabs, OpenAI, Google, Azure) on cost, multilingual support, self-hostability, voice replication, and latency is organized as a requirements-driven decision framework in the TTS thorough comparison guide.
8. Production operations design: idempotency, resilience, observability
Speech generation is a job of "external API, with billing, large batch." Call it naively, and a mid-way failure means redoing everything and double billing. Here's the minimal equipment for the generation side, paired with the transcription side (the Whisper guide).
8.1 A content-hash idempotency cache
The same combination of script, timbre, language, and model is the same audio no matter how many times you generate it. So if you cache keyed on the content hash, you can skip already-generated items on re-run, making retries idempotent and reducing billing.
import hashlib
from pathlib import Path
def tts_key(text: str, voice: str, language: str, model: str) -> str:
"""入力で決まる決定的キー。同入力 → 同キー → 生成をスキップ(冪等・節約)。"""
h = hashlib.sha256()
h.update(f"{model}\x00{voice}\x00{language}\x00{text}".encode("utf-8"))
return h.hexdigest()
def synthesize_idempotent(text: str, voice: str, language: str, model: str, cache_dir: Path) -> Path:
key = tts_key(text, voice, language, model)
out = cache_dir / f"{key}.wav"
if out.exists():
return out # 再実行はAPIを叩かない(冪等・コスト削減)
resp = dashscope.MultiModalConversation.call(
model=model, api_key=os.getenv("DASHSCOPE_API_KEY"),
text=text, voice=voice, language_type=language, stream=False,
)
fetch_and_store(resp.output.audio.url, str(out)) # 24時間で消える前に退避
return out
8.2 Retry with exponential backoff (limit the targets)
External APIs definitely fail with rate limits and transient outages. Apply retries only to idempotent operations, and don't retry 4xx (invalid input).
import time
def with_retry(fn, *, max_attempts: int = 4, base: float = 1.0):
"""指数バックオフ。一時障害だけ再試行し、入力不正は即失敗させる(fail fast)。"""
for attempt in range(1, max_attempts + 1):
try:
return fn()
except Exception as e: # 実運用ではdashscopeの例外型で分類する
transient = _is_transient(e) # 429/5xx/接続断 → True
if not transient or attempt == max_attempts:
raise
time.sleep(base * (2 ** (attempt - 1))) # 1s, 2s, 4s, ...
8.3 Observability: what to always record
For a speech-generation job, record metadata, not the body (the script, which can be PII).
- Job ID / input hash / model (snapshot name) / timbre / language
- Character count (= the primary cause of billing) and estimated cost
- Processing time, and
first_audio_delayfor realtime - Failure type (rate limit / timeout / invalid input) and retry count
Emit these as structured logs (OpenTelemetry, etc.), and "which generation got stuck" and "whether the cost is reasonable" become traceable at a glance. In projects where the read-aloud script can contain personal information, not leaving the body in logs is an absolute condition of internal control.
9. UX / accessibility: putting generated audio into a product
Speech synthesis can be a powerful ally of accessibility, but if you implement it wrong, it conversely produces WCAG violations and UX degradation. Mandatory items when putting it into a product:
- Don't autoplay: sounding audio on its own hinders screen-reader users and also runs afoul of WCAG 1.4.2 (Audio Control). Always make playback start from a user action and provide play/pause/volume controls.
- Always provide a text alternative: generated audio should have source text. Present the same text as the read-aloud on screen so users who can't rely on hearing can also access it (don't make it audio "only").
- Disclose AI generation: make explicit that it's synthesized audio (trustworthiness, ethics, future regulatory compliance).
- Player a11y: operations complete via keyboard, attach
aria-label, and a waveform animation that respectsprefers-reduced-motion. - Perceived speed via caching: Section 8's idempotency cache, by eliminating regeneration of the same phrase, makes the perceived experience instant from the first time onward (two birds with one stone for cost and UX).
Align the front-end implementation norms with the Next.js × React accessibility (WCAG 2.2) guide. The complete implementation of an accessible "article read-aloud" player (server-side generation + a WCAG-compliant React player) is summarized in the Next.js × Qwen-TTS read-aloud player implementation guide.
10. Recipes by use case (applications)
How to land the official features into actual projects. Four representative patterns.
10.1 Multilingual e-learning / batch generation of narration
Turn the same script into 10 languages with a fixed timbre. Just looping over language_type lets you deploy multilingually while keeping the voice brand.
SCRIPT = {"Japanese": "ようこそ。", "English": "Welcome.", "Korean": "환영합니다."}
for lang, text in SCRIPT.items():
path = synthesize_idempotent(text, voice="Cherry", language=lang,
model="qwen3-tts-flash", cache_dir=Path("./out"))
print(lang, path) # 同一音色・多言語のナレーションが冪等に揃う
10.2 Video localization / multilingual dubbing
This is the "voice" part of the pipeline I actually built on the AI video localization / lip-sync platform. TTS handles the "multilingual dubbing" in the flow audio separation → (with Whisper) transcription → translation → (with Qwen-TTS) multilingual dubbing → lip sync. Qwen3-TTS's 10-language support has the advantage of covering this dubbing step with a single model. For the lip-sync side's design, see the LatentSync guide.
10.3 Voice assistant / IVR (realtime)
The conversation's response generation (LLM) → feed its output into *-realtime, and in commit mode synthesize per utterance and play incrementally. Combine the LLM-side streaming design with the Vercel AI SDK production guide, and you can create the perceived experience of "speaking while thinking." For the context of RAG-based customer service, the generative-AI voice chatbot is a close case.
10.4 Character voice / advertising
Specify the direction in natural language with qwen3-tts-instruct-flash (the instruct example in 10.3), or design a one-of-a-kind voice in the world with OSS VoiceDesign. Combine it with the dialect timbres (4.2), and you can also create characters with regionality.
11. Security, compliance, ethics
- Keep the API key server-side: don't call DashScope directly from the browser (key leakage). With Next.js, go via a Route Handler / Server Action, with the key in an environment variable (Section 3).
- Validate input at the boundary: a length cap on
text(billing and DoS countermeasure), and constrainvoice/language_typewith an allow-list (enum). Don't pass user-derived values through raw. - Data residency: grasp which Region the text is sent to, and if cross-border transfer is a problem, address it with the OSS version or Region selection (1.1).
- The lifetime of the generated URL:
output.audio.urlexpires in 24 hours. Move it to your own storage immediately after receiving it, and deliver via a signed URL (Section 2). - Consent for voice cloning: put the person's consent, use limitation, and AI-generation disclosure into both the contract and the implementation (Section 6). This is a line you can't compromise on.
12. Summary: a selection cheat sheet
Finally, a quick-reference table for when you're unsure.
- For Japanese/multilingual narration, same-day, for now:
qwen3-tts-flash(API). Matchlanguage_typeto the body language, and fix the timbre by use case. - You want fine direction (speed, intonation, emotion):
qwen3-tts-instruct-flash+instructions(natural language). - You don't want to make users wait in conversation / voice assistants:
qwen3-tts-flash-realtime(WebSocket,commitmode). Keep the connection warm. - A custom voice / can't let data leave / unlimited at fixed cost: the OSS version (Apache-2.0). 3-second cloning or VoiceDesign. Assumes GPU operation and an ethics gate.
- You need a Chinese dialect: dialects are on the timbre side. Beijing = Dylan, Shanghai = Jada, Sichuan = Sunny/Eric, Cantonese = Rocky/Kiki, and others.
- Common equipment for productionizing: a content-hash idempotency cache, target-limited exponential backoff, immediate move of the 24-hour URL, observability that doesn't emit PII, and a11y (don't autoplay + a text alternative).
Speech synthesis looks like "a one-line requirement," but it's the work of designing the trade-offs of timbre, language, cost, real-time-ness, privacy, and ethics. On a multilingual dubbing pipeline, I operated TTS in production as "the voice of localization" and assembled it as a job guaranteeing idempotency, resilience, and observability (this project ranked #1 in CrowdWorks contracts).
"How do I turn my company's text/video into multilingual audio, and how do I embed it into operations or a product?" — from that design through implementation, operation, and ethics guards, I can accompany you end to end. Even from the requirements-organizing stage, feel free to consult me.
Reference (official documentation)
- Qwen-TTS speech synthesis (Alibaba Cloud Model Studio) — model lineup, timbres,
MultiModalConversation/ HTTP API, parameters - Qwen-TTS real-time speech synthesis (Model Studio) — WebSocket,
server_commit/commit,first_audio_delay - QwenLM/Qwen3-TTS (GitHub, Apache-2.0) — OSS-version weights, voice clone/design, benchmarks (97ms / WER / speaker similarity)
- Qwen3-TTS update (Qwen official blog) — the claimed values of 49 timbres / 10 languages / 9 dialects / seed-tts-eval
- Time to Speak Some Dialects, Qwen-TTS! (first-generation explanation) — the first generation's 7 timbres, dialects, SeedTTS-Eval numbers
- Model Studio pricing — per-character billing, the latest free-tier values (verify before going to production)