# Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning

> An implementation guide for using Qwen-TTS / Qwen3-TTS at production quality. Explained with real code, faithful to the official docs: the model lineup (qwen3-tts-flash / instruct-flash / realtime / qwen-tts), 49 timbres / 10 languages / 9 Chinese dialects, choosing between the DashScope API (Python / HTTP / streaming) and the OSS version (Apache-2.0 / 3-second voice cloning / voice design), and pricing, idempotency, resilience, observability, and ethics.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: Python, 音声合成, 生成AI, Qwen, アーキテクチャ設計, 可観測性
- URL: https://tomodahinata.com/en/blog/qwen-tts-qwen3-tts-flash-production-guide
- Category: Voice AI
- Pillar guide: https://tomodahinata.com/en/blog/voice-ai-production-guide-stt-tts-voice-agents

## Key points

- "Qwen-TTS" has two entities — the managed API (DashScope / Model Studio) and the OSS version (GitHub, Apache-2.0); decide which you'll use first
- The flagship Qwen3-TTS-Flash supports 49+ timbres, 10 languages, and 9 Chinese dialects, and claims SOTA stability for Chinese and English on seed-tts-eval
- The API specifies voice / language_type / stream via MultiModalConversation.call (or HTTP's multimodal-generation/generation). The generated URL expires in 24 hours, so move it to your own storage immediately
- The instruct model can control speaking rate, intonation, and emotion with natural-language instructions, and realtime supports incremental 'speak-while-generating' playback over WebSocket
- The OSS version enables voice cloning from 3 seconds of audio and natural-language voice design. Build into the design the ethical boundary that prohibits replicating a voice without the person's consent

---

"I want to turn text into natural speech" — as a requirement it's one line. But the moment you try to put it into production, the things to decide explode at once. **Do you call a managed API, or run the OSS version yourself? Which timbre (voice) do you choose? Japanese only, or 10-language multilingual narration? Do you return it while speaking in real time, or mass-produce audio files in batch? How do you plug the pitfall that the generated URL disappears in 24 hours? And — how do you hold the line of not replicating someone else's voice without permission?**

This article is an implementation guide for operating Alibaba Cloud's (Tongyi/Qwen) speech-synthesis models **Qwen-TTS / Qwen3-TTS** at **production quality**. I covered the transcription (STT) side's design in the [OpenAI Whisper production guide](/blog/openai-whisper-production-guide-selfhost-vs-api). This article explains the paired **speech-generation (TTS) side** design, weaving in the knowledge I actually gained building a multilingual dubbing pipeline ([the AI video localization platform](/case-studies/ai-video-localization-lipsync)). The map and technology selection for voice AI as a whole (STT / TTS / voice agents) is summarized in the [voice AI production guide](/blog/voice-ai-production-guide-stt-tts-voice-agents).

> **The rule of this article**: Model and API specs are based on **the contents of the Qwen official blog, the Alibaba Cloud Model Studio docs, and GitHub (QwenLM/Qwen3-TTS) as of June 2026**. Because pricing, model names, and supported languages get revised, always check the latest values at the [official documentation](https://www.alibabacloud.com/help/en/model-studio/qwen-tts) before going to production. The code is shaped into a form usable in real operations, but API keys are assumed to be in environment variables (hardcoding strictly forbidden).

---

## 0. The first map: "Qwen-TTS" has two entities

Start designing while confusing this, and you'll definitely get stuck later. The term "Qwen-TTS" refers to **two delivery forms of differing natures**.

| | ① Managed API (DashScope / Model Studio) | ② OSS version (GitHub, open-weight) |
| --- | --- | --- |
| Entity | The `dashscope` SDK / HTTP endpoint | `Qwen3-TTS-12Hz-*` (Hugging Face weights) |
| Representative models | `qwen3-tts-flash` / `qwen3-tts-instruct-flash` / `*-realtime` | `1.7B-Base` / `1.7B-CustomVoice` / `1.7B-VoiceDesign` |
| Where it runs | Alibaba Cloud's servers (you send the text) | Your own GPU (text doesn't leave) |
| License | Commercial use follows Model Studio's terms | **Apache-2.0** (you own and can modify the weights) |
| Billing | Per-character usage-based | Fixed cost of compute resources (GPU time) |
| Strengths | Zero ops, same-day adoption, 49+ timbres | Data sovereignty, 3-second voice cloning, unlimited |

**When you say "use Qwen-TTS," first settle which — ① or ② — you mean.** This article covers both and presents the selection criteria (Section 7). For many projects, "first validate value with ①'s API → consider ② once a custom voice or data sovereignty becomes a requirement" is the right order (YAGNI).

---

## 1. What is Qwen3-TTS-Flash (the official numbers)

The current flagship is **Qwen3-TTS-Flash**. The numbers the official docs publish are as follows.

- **49+ timbres**: covering gender, age, regionality, and character.
- **10 languages**: Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian.
- **9 Chinese dialects**: Mandarin, Hokkien, Wu (Shanghainese), Cantonese, Sichuanese, Beijing, Nanjing, Tianjin, Shaanxi.
- Trained on **over 5 million hours** of audio data.
- **Benchmarks**: SOTA stability for Chinese and English on the seed-tts-eval test set (claimed to surpass SeedTTS, MiniMax, and GPT-4o-Audio-Preview). On MiniMax's multilingual test set too, claimed to have a lower average WER (word error rate) than MiniMax, ElevenLabs, and GPT-4o-Audio-Preview.

For reference, the first-generation Qwen-TTS was released with 7 timbres (Cherry, Ethan, Chelsie, Serena, Dylan, Jada, Sunny), recording WER 1.209–2.206 and speaker similarity (SIM) 0.473–0.804 on SeedTTS-Eval. If you grasp the Qwen3 generation as a lineage that greatly expanded timbres, languages, dialects, and stability from there, your selection outlook becomes clear.

### 1.1 Model lineage and Regions (this is a pitfall)

Model Studio's model names **differ by Region**. Note that the lineup differs between International (Singapore) and Mainland China (Beijing).

| Model | Role | International (Singapore) | Mainland China (Beijing) |
| --- | --- | --- | --- |
| `qwen3-tts-flash` | Standard fast synthesis (49+ timbres) | ✅ | ✅ |
| `qwen3-tts-instruct-flash` | Instruction-control of speaking style in natural language | ✅ | ✅ |
| `qwen3-tts-flash-realtime` | Incremental synthesis over WebSocket | ✅ | ✅ |
| `qwen3-tts-instruct-flash-realtime` | Realtime with instruction control | ✅ | ✅ |
| `qwen-tts` / `qwen-tts-latest` | First generation (old name) | — | ✅ only |

- **`qwen3-tts-flash` is the stable-version alias**, which at the time of writing corresponds to the snapshot `qwen3-tts-flash-2025-11-27` (you can also specify past versions like `-2025-09-18`). In production, **fixing the snapshot name** avoids timbre behavior changes from alias updates (reproducibility / testability).
- **The endpoint host also differs by Region.** International is `dashscope-intl.aliyuncs.com`, Mainland China is `dashscope.aliyuncs.com`. **The first-generation `qwen-tts` is provided only in the Mainland China Region**, so for requirements that can only use the International Region, choose `qwen3-tts-flash`.
- From the standpoint of **data residency** too, which Region the text is sent to is a design decision. If cross-border transfer of confidential text is a problem, you need to consider Region selection or the OSS version (Section 6).

---

## 2. Get it running in 5 minutes: the DashScope API (Python)

First, generate one piece of audio by the shortest path. The SDK is `dashscope`, and the call is `MultiModalConversation.call`.

```bash
pip install -U dashscope
```

```python
import os
import dashscope

# 国際リージョン（シンガポール）を使う。中国本土なら .aliyuncs.com 側を指定。
dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"

response = dashscope.MultiModalConversation.call(
    model="qwen3-tts-flash",
    api_key=os.getenv("DASHSCOPE_API_KEY"),  # キーは環境変数から（ハードコード禁止）
    text="本日はお集まりいただき、ありがとうございます。",
    voice="Cherry",            # 音色（声）。第4章の一覧から選ぶ
    language_type="Japanese",  # テキストの言語に合わせる（後述）
    stream=False,              # 一括生成（URLで受け取る）
)

# 非ストリーミングでは「音声ファイルのURL」が返る（生成から24時間で失効）
audio_url = response.output.audio.url
print(audio_url)
```

The **biggest pitfall** here is that `response.output.audio.url` is a **temporary URL that expires 24 hours after generation**. In production, save it to your own storage (S3 / Vercel Blob, etc.) immediately after receiving it, then use it in your business logic.

```python
import requests

def fetch_and_store(audio_url: str, dest: str) -> None:
    """一時URLは24時間で消える。受領直後に自前ストレージへ退避するのが鉄則。"""
    res = requests.get(audio_url, timeout=30)
    res.raise_for_status()
    with open(dest, "wb") as f:
        f.write(res.content)  # 実運用では S3/Blob の put に置き換える
```

> **A tip on `language_type`**: the official docs "recommend matching it to the text's language." For Japanese text, `"Japanese"`; for English, `"English"`. Even with the same timbre (e.g. Cherry), just switching `language_type` lets you **narrate the same script across 10 languages differently** — a strength of Qwen3-TTS (directly relevant to multilingual e-learning / dubbing; Section 10).

---

## 3. The HTTP API: calling it from a TypeScript / Next.js server

If you're using a language without an SDK, or calling from TypeScript / Next.js like this site, hit HTTP directly. **Don't call it from the browser; always execute it server-side (Route Handler / Server Action)** to keep the key secret.

The endpoint (International Region):

```text
POST https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation
```

A Next.js Route Handler example. **Validate the input at the boundary with Zod**, and constrain timbre and language with an allow-list (don't pass user-derived values through raw).

```ts
// app/api/tts/route.ts — サーバー側でのみ実行（APIキーはブラウザに出さない）
import { z } from "zod";

// 許可する音色・言語を型で固定（不正値をAPIに渡さない＝コスト事故と例外を防ぐ）
const VOICES = ["Cherry", "Ethan", "Serena", "Jennifer", "Ryan"] as const;
const LANGS = ["Japanese", "English", "Chinese", "Korean"] as const;

const Body = z.object({
  text: z.string().min(1).max(2000), // 上限は要件に合わせる（課金は文字数ベース）
  voice: z.enum(VOICES),
  language_type: z.enum(LANGS),
});

export async function POST(req: Request) {
  const parsed = Body.safeParse(await req.json());
  if (!parsed.success) {
    return Response.json({ error: parsed.error.flatten() }, { status: 400 });
  }

  const upstream = await fetch(
    "https://dashscope-intl.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.DASHSCOPE_API_KEY!}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        model: "qwen3-tts-flash",
        input: parsed.data, // { text, voice, language_type }
      }),
    },
  );

  if (!upstream.ok) {
    // 上流のエラーはそのまま返さず、こちらで分類してログ（内部詳細は漏らさない）
    return Response.json({ error: "tts_upstream_failed" }, { status: 502 });
  }

  const json = (await upstream.json()) as { output: { audio: { url: string } } };
  return Response.json({ url: json.output.audio.url }); // 受領後は自前ストレージへ
}
```

The request body's shape is `{ "model": ..., "input": { "text", "voice", "language_type" } }`, and the response is `output.audio.url`. Only when using streaming, add `X-DashScope-SSE: enable` to the header (next section).

---

## 4. Choosing and manipulating the voice: timbre, language, dialect, instruction control

### 4.1 How to choose the timbre (voice)

From 49+ timbres, it's practical to **decide on a representative timbre by use case**. Representative examples (summarizing the official descriptions):

| Use case | Representative timbre | Official description (summary) |
| --- | --- | --- |
| General narration (female, bright) | `Cherry` | A young woman, bright and friendly like the sun |
| General narration (male, standard) | `Ethan` | Standard Mandarin with a slight northern accent. Energetic and warm |
| English cinema-quality (female) | `Jennifer` | A premium, movie-quality American-English female |
| Drama / trailers (male) | `Ryan` | Rhythmic, with dramatic intonation |
| Calm academic / explanatory | `Elias` | Academic rigor using storytelling techniques |

All standard timbres **support 10 languages**. That is, you can run an operation that **keeps the timbre fixed and switches only the language** — "Japanese narration in Cherry, the English version of the same script also in Cherry." Being able to deploy multilingually while keeping voice-brand consistency pays off in localization projects.

### 4.2 Producing Chinese dialects (dialects are dedicated timbres)

Dialects are **tied to the timbre side**. To produce the targeted regionality, choose the corresponding timbre.

| Dialect | Representative timbre | Dialect | Representative timbre |
| --- | --- | --- | --- |
| Beijing | `Dylan` | Shanghai (Wu) | `Jada` |
| Sichuan | `Sunny` / `Eric` | Cantonese | `Rocky` / `Kiki` |
| Tianjin | `Peter` | Nanjing | `Li` |
| Shaanxi | `Marcus` | Hokkien (Taiwan) | `Roy` |

For character voices, local commercials, and entertainment uses aimed at the Chinese-speaking sphere, these dialect timbres become a differentiator.

### 4.3 Controlling speaking style "with words": the instruct model

Rather than touching speed, intonation, and emotion with fine parameters, `qwen3-tts-instruct-flash` lets you **control them with natural-language instructions**.

```python
response = dashscope.MultiModalConversation.call(
    model="qwen3-tts-instruct-flash",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    text="期間限定セール、本日スタートです！",
    voice="Cherry",
    language_type="Japanese",
    # 話し方を自然言語で指定（速度・抑揚・感情）
    instructions="やや速めのテンポで、明るく弾むような上がり調子。ファッション商品の紹介向け。",
    optimize_instructions=True,  # 指示文をモデル側で最適化処理する
    stream=False,
)
```

According to the official docs, `instructions` is **up to 1,600 tokens**, and the supported languages are **Chinese and English** (the language of the instruction text itself; operations like reading Japanese body text while writing the instruction in English are also possible). Envisioned uses listed include audiobooks, radio dramas, ads, game-character voiceovers, voice assistants, and documentary narration.

> **Design judgment**: holding the speaking style not as a **code parameter** but as a **prompt (natural language)** lets a director (a non-engineer) directly adjust the direction. This is a design that externalizes "direction = data," so a deploy isn't needed on every revision (ETC: ease of change). On the other hand, for uses that require reproducibility, version-manage a snapshot model + fixed instructions.

---

## 5. Real-time speech synthesis: speak while generating (WebSocket)

In scenes like voice assistants and conversational UIs where "you don't want to make the user wait for the whole text to finish generating," use the `*-realtime` models over **WebSocket** and play incrementally while generating.

- Endpoint (International): `wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`
- Models: `qwen3-tts-flash-realtime` / `qwen3-tts-instruct-flash-realtime`

The Python SDK receives audio chunks via a callback.

```python
from dashscope.audio.qwen_tts_realtime import QwenTtsRealtime, AudioFormat

rt = QwenTtsRealtime(
    model="qwen3-tts-flash-realtime",
    callback=callback,  # response.audio.delta（base64音声）を逐次受け取る
    url="wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime",
)
rt.connect()
rt.update_session(
    voice="Cherry",
    response_format=AudioFormat.PCM_24000HZ_MONO_16BIT,  # 24kHz/モノ/16bit
    mode="server_commit",  # サーバーがテキストを賢く区切る（後述）
)
# 以降、テキストを流し込むと response.audio.delta イベントで音声断片が届く
```

### 5.1 The two modes: `server_commit` and `commit`

Realtime has **two modes that differ in how text is segmented**. Choose by requirement.

- **`server_commit` (server-led)**: the server side intelligently splits the text. Suited for **continuously synthesizing long passages** (reading articles aloud, narration).
- **`commit` (client-led)**: the client manually commits the text buffer to trigger synthesis. When you want **precise control in a conversational scenario** (segmenting per utterance in a chat).

### 5.2 How to think about latency

The SDK can measure `first_audio_delay` (the time from sending the request to receiving the first audio fragment). The official docs explicitly state that **"because the first text send includes establishing the WebSocket connection, the initial first-packet latency includes the connection setup time."** That is, the standard is to **reuse the connection and keep it warm**. Note that the OSS version (Section 6) claims an end-to-end synthesis latency of **97ms**, and this initial response is what determines the perceived quality of a conversation.

> Recorded text's "batch generation" (Section 2) and a conversation's "live synthesis" (this section) are different things. Mistake the requirement and you'll either open WebSockets needlessly or, conversely, create a UX that makes users wait.

The details of the WebSocket protocol, gapless PCM playback in the browser, barge-in (interruption), and pipelining LLM → TTS are dug into in the [Qwen-TTS real-time voice agent implementation guide](/blog/qwen-tts-realtime-voice-agent-websocket-streaming-guide).

---

## 6. Creating your own voice: the OSS version (voice cloning & voice design)

"I want it read in my (or our talent's) voice," "I can't let data leave," "I want to run it unlimited at a fixed cost" — for those requirements, the **OSS version (QwenLM/Qwen3-TTS)**, released under Apache-2.0, becomes an option.

The main published checkpoints:

| Model | Features |
| --- | --- |
| `Qwen3-TTS-12Hz-1.7B-Base` | Voice cloning from **3 seconds of reference audio** |
| `Qwen3-TTS-12Hz-1.7B-CustomVoice` | 9 preset speakers (Vivian / Serena / Uncle_Fu / Dylan / Eric / Ryan / Aiden / Ono_Anna / Sohee) |
| `Qwen3-TTS-12Hz-1.7B-VoiceDesign` | **Design the voice itself in natural language** |
| `*-0.6B-Base` / `*-0.6B-CustomVoice` | Lightweight version (memory-saving) |

```bash
pip install -U qwen-tts   # FlashAttention 2 併用を推奨。GPU(bf16/fp16)前提
```

**Voice cloning** (from 3 seconds of reference audio + its transcript, read arbitrary text in the same voice):

```python
import torch
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
)

wavs, sr = model.generate_voice_clone(
    text="この声で、好きな原稿を読み上げます。",
    language="Japanese",
    ref_audio="ref.wav",          # 3秒程度の参照音声
    ref_text="参照音声の書き起こし", # 参照音声に対応するテキスト
)
```

**Voice design** ("ordering" the voice's texture in natural language):

```python
wavs, sr = model.generate_voice_design(
    text="ようこそ、いらっしゃいませ。",
    language="Japanese",
    instruct="落ち着いた中低音の男性。ホテルのコンシェルジュのように丁寧で温かい。",
)
```

The OSS version's benchmarks (official README): `1.7B-Base` scores **1.24 WER** on SEED test-en, Chinese speaker similarity **0.811**, and **2.356 WER** on long-zh. It supports 10 languages.

> **Ethics and legal are not a "feature" but a "premise"**: voice cloning is a technology that **makes it possible to replicate a voice without the person's consent**. Because there are risks of impersonation, fraud, and defamation, in production it's a mandatory condition to **build into both the design and the contract** "① the written consent of the voice provider," "② explicit, limited use," and "③ disclosure that the output is AI audio." This is a line you can't drop in productionizing any voice AI, not just Qwen-TTS.

The OSS version's self-hosted implementation (a FastAPI inference server) and the governance design covering a consent ledger, use limitation, expiry, disclosure, provenance, and audit logs are detailed in the [Qwen-TTS voice-cloning production implementation guide](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide).

---

## 7. A selection framework: how to decide API vs. OSS

It's not "which is correct," but choosing the one that fits your requirements along **four axes**.

| Decision axis | OSS version (self-hosted) is favorable | The managed API is favorable |
| --- | --- | --- |
| **Privacy / data sovereignty** | You **can't let text/voice leave** (medical, government, confidential) | External transmission is acceptable |
| **Cost structure** | You want **fixed cost** for high-volume / always-on (GPU amortization) | Low-volume / spiky, where **variable cost** is reasonable |
| **Customizability** | A **custom voice** (clone/design) is a requirement | The 49+ ready-made timbres suffice |
| **Operational structure** | You **can operate** GPU, model updates, and scaling | You don't want to operate; you want accuracy same-day |

### Cost intuition (verify against the official source)

Model Studio's billing is **per-character**. By a rough estimate from public information, `qwen3-tts-flash` is roughly **about $0.013 per 1,000 characters (≈ about $13 per million characters)**, realtime is priced separately, and new users are said to have a free tier (roughly around a million characters). **However, these amounts vary and differ by Region, so always check the latest values on the [official pricing page](https://www.alibabacloud.com/help/en/model-studio/model-pricing).**

The decision pattern is this: **for around tens of thousands to hundreds of thousands of characters per month, the API is overwhelmingly cheaper and zero-ops**. Conversely, if you **generate large volumes of narration constantly every day / need a custom voice / can't let text leave**, a break-even point appears where the fixed cost of running the OSS version on a GPU wins on unit price and requirement fit. "First validate value with the API → consider migrating to OSS once volume and custom requirements solidify" is the right order for many projects (KISS / cost efficiency).

A side-by-side comparison with other TTS (ElevenLabs, OpenAI, Google, Azure) on cost, multilingual support, self-hostability, voice replication, and latency is organized as a requirements-driven decision framework in the [TTS thorough comparison guide](/blog/qwen-tts-vs-elevenlabs-openai-google-azure-tts-comparison).

---

## 8. Production operations design: idempotency, resilience, observability

Speech generation is a job of "external API, with billing, large batch." **Call it naively, and a mid-way failure means redoing everything and double billing.** Here's the minimal equipment for the generation side, paired with the transcription side ([the Whisper guide](/blog/openai-whisper-production-guide-selfhost-vs-api)).

### 8.1 A content-hash idempotency cache

The same combination of script, timbre, language, and model is **the same audio no matter how many times you generate it**. So if you **cache keyed on the content hash**, you can skip already-generated items on re-run, making retries idempotent and reducing billing.

```python
import hashlib
from pathlib import Path

def tts_key(text: str, voice: str, language: str, model: str) -> str:
    """入力で決まる決定的キー。同入力 → 同キー → 生成をスキップ（冪等・節約）。"""
    h = hashlib.sha256()
    h.update(f"{model}\x00{voice}\x00{language}\x00{text}".encode("utf-8"))
    return h.hexdigest()

def synthesize_idempotent(text: str, voice: str, language: str, model: str, cache_dir: Path) -> Path:
    key = tts_key(text, voice, language, model)
    out = cache_dir / f"{key}.wav"
    if out.exists():
        return out  # 再実行はAPIを叩かない（冪等・コスト削減）

    resp = dashscope.MultiModalConversation.call(
        model=model, api_key=os.getenv("DASHSCOPE_API_KEY"),
        text=text, voice=voice, language_type=language, stream=False,
    )
    fetch_and_store(resp.output.audio.url, str(out))  # 24時間で消える前に退避
    return out
```

### 8.2 Retry with exponential backoff (limit the targets)

External APIs definitely fail with rate limits and transient outages. Apply retries **only to idempotent operations**, and **don't retry 4xx (invalid input)**.

```python
import time

def with_retry(fn, *, max_attempts: int = 4, base: float = 1.0):
    """指数バックオフ。一時障害だけ再試行し、入力不正は即失敗させる（fail fast）。"""
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except Exception as e:  # 実運用ではdashscopeの例外型で分類する
            transient = _is_transient(e)  # 429/5xx/接続断 → True
            if not transient or attempt == max_attempts:
                raise
            time.sleep(base * (2 ** (attempt - 1)))  # 1s, 2s, 4s, ...
```

### 8.3 Observability: what to always record

For a speech-generation job, record **metadata, not the body (the script, which can be PII)**.

- Job ID / input hash / model (snapshot name) / timbre / language
- Character count (= the primary cause of billing) and estimated cost
- Processing time, and `first_audio_delay` for realtime
- Failure type (rate limit / timeout / invalid input) and retry count

Emit these as structured logs ([OpenTelemetry](/blog/opentelemetry-observability-production-tracing-metrics-logs), etc.), and "which generation got stuck" and "whether the cost is reasonable" become traceable at a glance. In projects where **the read-aloud script can contain personal information**, not leaving the body in logs is an absolute condition of internal control.

---

## 9. UX / accessibility: putting generated audio into a product

Speech synthesis can be **a powerful ally of accessibility**, but if you implement it wrong, it conversely produces WCAG violations and UX degradation. Mandatory items when putting it into a product:

- **Don't autoplay**: sounding audio on its own hinders screen-reader users and also runs afoul of WCAG 1.4.2 (Audio Control). **Always make playback start from a user action** and provide play/pause/volume controls.
- **Always provide a text alternative**: generated audio should have source text. **Present the same text as the read-aloud on screen** so users who can't rely on hearing can also access it (don't make it audio "only").
- **Disclose AI generation**: make explicit that it's synthesized audio (trustworthiness, ethics, future regulatory compliance).
- **Player a11y**: operations complete via keyboard, attach `aria-label`, and a waveform animation that respects `prefers-reduced-motion`.
- **Perceived speed via caching**: Section 8's idempotency cache, by eliminating regeneration of the same phrase, **makes the perceived experience instant from the first time onward** (two birds with one stone for cost and UX).

Align the front-end implementation norms with the [Next.js × React accessibility (WCAG 2.2) guide](/blog/react-nextjs-web-accessibility-wcag22-guide). The complete implementation of an accessible "article read-aloud" player (server-side generation + a WCAG-compliant React player) is summarized in the [Next.js × Qwen-TTS read-aloud player implementation guide](/blog/nextjs-qwen-tts-accessible-audio-player-text-to-speech).

---

## 10. Recipes by use case (applications)

How to land the official features into actual projects. Four representative patterns.

### 10.1 Multilingual e-learning / batch generation of narration

**Turn the same script into 10 languages with a fixed timbre.** Just looping over `language_type` lets you deploy multilingually while keeping the voice brand.

```python
SCRIPT = {"Japanese": "ようこそ。", "English": "Welcome.", "Korean": "환영합니다."}

for lang, text in SCRIPT.items():
    path = synthesize_idempotent(text, voice="Cherry", language=lang,
                                 model="qwen3-tts-flash", cache_dir=Path("./out"))
    print(lang, path)  # 同一音色・多言語のナレーションが冪等に揃う
```

### 10.2 Video localization / multilingual dubbing

This is the "voice" part of the pipeline I actually built on the [AI video localization / lip-sync platform](/case-studies/ai-video-localization-lipsync). TTS handles the "multilingual dubbing" in the flow **audio separation → (with Whisper) transcription → translation → (with Qwen-TTS) multilingual dubbing → lip sync**. Qwen3-TTS's 10-language support has the advantage of covering this dubbing step with a single model. For the lip-sync side's design, see the [LatentSync guide](/blog/latentsync-lip-sync-diffusion-model-production-guide).

### 10.3 Voice assistant / IVR (realtime)

The conversation's response generation (LLM) → feed its output into `*-realtime`, and in `commit` mode synthesize per utterance and play incrementally. Combine the LLM-side streaming design with the [Vercel AI SDK production guide](/blog/vercel-ai-sdk-production-llm-apps-streaming-tools-rag), and you can create the perceived experience of **"speaking while thinking."** For the context of RAG-based customer service, [the generative-AI voice chatbot](/case-studies/ai-voice-chatbot) is a close case.

### 10.4 Character voice / advertising

Specify the direction in natural language with `qwen3-tts-instruct-flash` (the instruct example in 10.3), or design **a one-of-a-kind voice in the world** with OSS VoiceDesign. Combine it with the dialect timbres (4.2), and you can also create characters with regionality.

---

## 11. Security, compliance, ethics

- **Keep the API key server-side**: don't call DashScope directly from the browser (key leakage). With Next.js, go via a Route Handler / Server Action, with the key in an environment variable (Section 3).
- **Validate input at the boundary**: a length cap on `text` (billing and DoS countermeasure), and constrain `voice` / `language_type` with an allow-list (enum). Don't pass user-derived values through raw.
- **Data residency**: grasp which Region the text is sent to, and if cross-border transfer is a problem, address it with the OSS version or Region selection (1.1).
- **The lifetime of the generated URL**: `output.audio.url` expires in 24 hours. Move it to your own storage immediately after receiving it, and deliver via a signed URL (Section 2).
- **Consent for voice cloning**: put the person's consent, use limitation, and AI-generation disclosure into both the contract and the implementation (Section 6). This is a line you can't compromise on.

---

## 12. Summary: a selection cheat sheet

Finally, a quick-reference table for when you're unsure.

- **For Japanese/multilingual narration, same-day, for now**: `qwen3-tts-flash` (API). Match `language_type` to the body language, and fix the timbre by use case.
- **You want fine direction (speed, intonation, emotion)**: `qwen3-tts-instruct-flash` + `instructions` (natural language).
- **You don't want to make users wait in conversation / voice assistants**: `qwen3-tts-flash-realtime` (WebSocket, `commit` mode). Keep the connection warm.
- **A custom voice / can't let data leave / unlimited at fixed cost**: the OSS version (Apache-2.0). 3-second cloning or VoiceDesign. Assumes GPU operation and an ethics gate.
- **You need a Chinese dialect**: dialects are on the timbre side. Beijing = Dylan, Shanghai = Jada, Sichuan = Sunny/Eric, Cantonese = Rocky/Kiki, and others.
- **Common equipment for productionizing**: a content-hash idempotency cache, target-limited exponential backoff, immediate move of the 24-hour URL, observability that doesn't emit PII, and a11y (don't autoplay + a text alternative).

Speech synthesis looks like "a one-line requirement," but it's **the work of designing the trade-offs of timbre, language, cost, real-time-ness, privacy, and ethics**. On a multilingual dubbing pipeline, I operated TTS in production as "the voice of localization" and assembled it as a job guaranteeing idempotency, resilience, and observability ([this project ranked #1 in CrowdWorks contracts](/case-studies/ai-video-localization-lipsync)).

**"How do I turn my company's text/video into multilingual audio, and how do I embed it into operations or a product?" — from that design through implementation, operation, and ethics guards, I can accompany you end to end.** Even from the requirements-organizing stage, feel free to consult me.

---

### Reference (official documentation)

- [Qwen-TTS speech synthesis (Alibaba Cloud Model Studio)](https://www.alibabacloud.com/help/en/model-studio/qwen-tts) — model lineup, timbres, `MultiModalConversation` / HTTP API, parameters
- [Qwen-TTS real-time speech synthesis (Model Studio)](https://www.alibabacloud.com/help/en/model-studio/qwen-tts-realtime) — WebSocket, `server_commit` / `commit`, `first_audio_delay`
- [QwenLM/Qwen3-TTS (GitHub, Apache-2.0)](https://github.com/QwenLM/Qwen3-TTS) — OSS-version weights, voice clone/design, benchmarks (97ms / WER / speaker similarity)
- [Qwen3-TTS update (Qwen official blog)](https://qwen.ai/blog?id=qwen3-tts-1128) — the claimed values of 49 timbres / 10 languages / 9 dialects / seed-tts-eval
- [Time to Speak Some Dialects, Qwen-TTS! (first-generation explanation)](https://qwenlm.github.io/blog/qwen-tts/) — the first generation's 7 timbres, dialects, SeedTTS-Eval numbers
- [Model Studio pricing](https://www.alibabacloud.com/help/en/model-studio/model-pricing) — per-character billing, the latest free-tier values (verify before going to production)
