# TTS in-depth comparison 2026: choosing among Qwen-TTS / ElevenLabs / OpenAI / Google / Azure by cost, multilingual reach, self-hosting, voice cloning, and latency

> A selection guide for speech-synthesis (TTS) APIs/models. It compares Qwen3-TTS-Flash, ElevenLabs Flash v2.5, OpenAI gpt-4o-mini-tts, Google Chirp 3 HD, and Azure Neural/Custom Neural Voice across six axes — pricing (per-character vs per-token), supported languages, self-hostability, voice cloning, latency, and data residency — grounded in official sources. It presents a decision framework that works backward from your requirements.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 音声合成, 生成AI, Qwen, アーキテクチャ設計, コスト効率, 技術選定
- URL: https://tomodahinata.com/en/blog/qwen-tts-vs-elevenlabs-openai-google-azure-tts-comparison
- Category: Voice AI
- Pillar guide: https://tomodahinata.com/en/blog/voice-ai-production-guide-stt-tts-voice-agents

## Key points

- TTS selection isn't 'which is the most powerful' but 'which fits the requirements.' Decide on six axes: cost, multilingual reach, self-hosting, voice cloning, latency, and data residency.
- The only one that is open-weight (Apache-2.0) and self-hostable is Qwen3-TTS. If data sovereignty, no usage limits, and fixed cost are requirements, it's the only choice.
- Pricing uses different units (characters vs tokens vs minutes). In rough terms Qwen is among the cheapest and ElevenLabs is on the higher end, but it can flip depending on use case.
- Chinese and dialects → Qwen; ultra-multilingual → Google/Azure; ease of voice cloning → ElevenLabs; OpenAI-ecosystem integration → gpt-4o-mini-tts is strong.
- If you prioritize a voice-cloning 'consent gate,' Azure Custom Neural Voice (approval-based) and OSS self-hosting are the easiest to design around.

---

"So which TTS should I use, in the end?" — this is one of the most common questions on projects. And the only honest answer is **"it depends on the requirements."** Speech synthesis differs greatly product to product in pricing model, the languages it's good at, whether it can clone a voice, and whether you can run it on your own servers.

This article lines up the major TTS options **grounded in official sources** and presents **a framework for choosing by working backward from requirements**. The material is my own experience actually selecting and implementing TTS in [the AI video-localization platform](/case-studies/ai-video-localization-lipsync). For Qwen-TTS details, see the [Qwen-TTS production-operations guide](/blog/qwen-tts-qwen3-tts-flash-production-guide). The angle of deciding "API usage vs self-hosting" from cost and data sovereignty is covered in [TTS self-hosting vs ElevenLabs (cost, data sovereignty)](/blog/tts-self-hosting-vs-elevenlabs-cost-data-sovereignty-guide).

> **Rules for this article**: specs and pricing are **rough estimates** based on **each vendor's official documentation (as of June 2026)**. **TTS pricing and models are revised quickly, and units (characters / tokens / minutes) are all over the place.** For an actual estimate, always compute on each vendor's pricing page against your own workload (language, character count, concurrency). This article provides "order-of-magnitude feel" and "selection axes."

---

## 0. First, hold six selection axes

What matters in TTS selection is the following six axes. Rather than product superiority, decide first **which axes your own requirements lean toward**.

1. **Cost structure**: per-character, per-token, or per-minute. The optimum shifts with volume and spikes.
2. **Multilingual / dialects**: can it produce the languages you need at native quality? What about Chinese dialects?
3. **Self-hostability**: can you own the weights and run them in-house (= data sovereignty, no limits, fixed cost)?
4. **Voice cloning**: do you need it? If so, can you design a consent gate?
5. **Latency**: conversational use (speed of first audio) or batch (speed is secondary)?
6. **Data residency**: which country/operator do scripts and voices go to? Are there regulatory or confidentiality constraints?

---

## 1. At-a-glance comparison (rough estimates, grounded in official sources)

| | **Qwen3-TTS** | **ElevenLabs** | **OpenAI** | **Google** | **Azure** |
| --- | --- | --- | --- | --- | --- |
| Representative model | Qwen3-TTS-Flash | Flash v2.5 | gpt-4o-mini-tts | Chirp 3 HD | Neural / Custom Neural Voice |
| Self-host (OSS) | ✅ **Apache-2.0** | ❌ | ❌ | ❌ | ❌ |
| Pricing unit | Characters | Characters (credits) | **Tokens** | Characters | Characters |
| Pricing order of magnitude | ≈ $13 / 1M chars | ≈ $50 / 1M chars | ≈ $15 / 1M chars equivalent | $30 / 1M chars | $22 / 1M chars (Neural HD) |
| Supported languages | 10 | 32 | 50+ | 75+ | 100+ |
| Chinese dialects | ✅ **9 dialects** | △ | △ | △ | △ |
| Voice cloning | ✅ (API/OSS) | ✅ (instant + pro) | ❌ (not offered) | ✅ (instant custom) | ✅ (**approval-based**) |
| Low-latency guideline | 97ms (OSS stated) | **~75ms** (Flash) | first audio ~300–600ms | streaming supported | real-time supported |
| Instruction control (delivery) | ✅ instruct | ✅ (v3 tags, etc.) | ✅ instructions | △ | ✅ (style/SSML) |

> All numbers are **rough estimates** from official/public information as of June 2026. Because the pricing units and assumptions differ, read the side-by-side amounts as "order-of-magnitude feel" (a stricter comparison is in the next section).

---

## 2. Pricing: because the units differ, grasp it as "order of magnitude"

What most often causes misunderstanding in TTS pricing comparison is **the difference in units**.

- **Per-character** (Qwen, Google, Azure, ElevenLabs credits): `character count × unit price`. Easy to estimate.
- **Per-token** (OpenAI `gpt-4o-mini-tts`): input text $0.60/1M tokens + **audio output $12/1M tokens**. Since one Japanese character can become multiple tokens, **it's hard to read without measuring** (roughly said to be about $0.015/min equivalent).

Order-of-magnitude feel (rough estimate per 1M characters):

- **Qwen3-TTS-Flash**: among the cheapest (≈ $13). New-user free tier available.
- **OpenAI gpt-4o-mini-tts**: ≈ $15 equivalent in token terms.
- **Azure Neural HD**: $22 (cut from $30 in March 2026).
- **Google Chirp 3 HD**: $30, with a 1M-chars/month free tier.
- **ElevenLabs Flash/Turbo**: 0.5–1 credit per character via API (≈ $50/1M chars equivalent). A price band weighted toward quality and voice expressiveness.

**The decision pattern**: at small-to-mid volume, any of them is cheap enough as a "zero-ops API." At **high, always-on** volume, a break-even appears where self-hosting (Qwen OSS) fixed cost pays off (→ [TCO design for inference cost](/blog/llama-inference-cost-optimization-self-host-vs-api)). **"Verify with an API first → revisit once volume is predictable"** is the standard playbook.

---

## 3. The winning angle by axis

### 3.1 Self-hosting / data sovereignty → Qwen3-TTS (the only open-weight one)

Among the major TTS options, **the only one that publishes weights under Apache-2.0 and lets you run on your own GPUs is Qwen3-TTS**. For projects where scripts or voices can't be sent to an external operator (healthcare, government, confidential, talent contracts), this is the deciding factor. ElevenLabs, OpenAI, Google, and Azure are all hosted APIs, and **the text is sent to the operator**.

### 3.2 Chinese / dialects → Qwen3-TTS

Qwen supports **nine Chinese dialects** (Beijing, Shanghai, Sichuan, Cantonese, Tianjin, Nanjing, Shaanxi, Min Nan, Mandarin). For local ads, entertainment, and characters aimed at Greater China, this is a differentiator. Others can speak Chinese too, but Qwen is a head above on dialect craftsmanship.

### 3.3 Ultra-multilingual coverage → Google / Azure

In pure supported-language count, **Azure (100+) and Google (75+)** are broad. If you want to cover even minor languages widely and thinly, these two are candidates. Compared with Qwen's 10 languages, OpenAI's 50+, and ElevenLabs' 32, they have the edge on long-tail languages.

### 3.4 Ease of voice cloning → ElevenLabs / Azure or OSS if consent matters

- **Ease**: ElevenLabs clones instantly from short audio, plus high-quality pro clones. Strong for cloning-centric production.
- **Consent-gate emphasis**: **Azure Custom Neural Voice is approval-based by Microsoft (limited access)**, with anti-impersonation built in as a system. **With OSS self-hosting you can design your own consent ledger, disclosure, and provenance** (→ [voice-cloning governance design](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide)). For enterprise trust requirements, this "consent mechanism" becomes a reason for selection.
- **OpenAI**: no generally available voice cloning (off-the-shelf voices + instructions for direction). If you want to avoid impersonation risk, it's on the safer side.

### 3.5 Conversational low latency → ElevenLabs Flash / Qwen realtime

For speed of first audio, **ElevenLabs Flash v2.5 (~75ms)** and **Qwen OSS (97ms stated) / Qwen realtime** are strong. OpenAI's first audio is around 300–600ms, a notch slower for conversation. Building real-time voice agents is detailed in the [Qwen-TTS realtime implementation guide](/blog/qwen-tts-realtime-voice-agent-websocket-streaming-guide).

### 3.6 Existing-ecosystem integration → OpenAI / each cloud

If you already build your stack on OpenAI (Whisper, GPT), `gpt-4o-mini-tts` is highly consistent with the same SDK, billing, and `instructions`. If you're AWS-centric, Polly; if GCP/Azure-centric, each vendor's TTS rides easily on IAM, billing, and monitoring. **"Low friction with the stack you already have" is a legitimate selection axis too.**

---

## 4. Decision framework (working backward from requirements)

Apply the questions top to bottom to narrow candidates.

1. **Can't send scripts/voices externally?** → **If YES, Qwen3-TTS OSS (self-host)**. If NO, next.
2. **Are Chinese/dialects the star?** → **If YES, Qwen3-TTS**. If NO, next.
3. **Cloning a real person's voice?** → **YES**: Azure Custom Neural Voice or OSS if consent systems matter; ElevenLabs if ease matters. **NO**: continue with off-the-shelf voices.
4. **Is conversation (low latency) a requirement?** → **YES**: ElevenLabs Flash / Qwen realtime. **NO (batch)**: next.
5. **Ultra-multilingual (minor languages)?** → **YES**: Google / Azure. **NO**: next.
6. **Cost-first & stack-neutral?** → **Qwen3-TTS-Flash** (among the cheapest, new-user free tier). OpenAI-centric → `gpt-4o-mini-tts`.

> This flow is only **the first-pass narrowing**. The final decision should always come after **actually sounding out each candidate on your own scripts (a PoC)** and measuring quality, latency, and cost. **The iron rule is to "choose TTS by ear"** (don't decide on numbers alone).

---

## 5. Common misconceptions (pitfalls)

- **"More supported languages ≠ that language is high quality"**: coverage and native quality are separate. Verify only the languages you need with a PoC.
- **"Cheapest ≠ optimal"**: if voice expressiveness, latency, or operations are requirements, a higher unit price can win overall.
- **Estimating per-token pricing as if it were per-character will miss**: always measure OpenAI in tokens.
- **"Can clone" and "may clone" are different**: a clone without consent, disclosure, and provenance design is asking for trouble ([governance design](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide)).
- **Price tables go stale**: including the numbers in this article, **always recompute against the latest official pricing before ordering**.

---

## 6. Summary: a selection cheat sheet

- **Can't send data out / no limits / fixed cost**: **Qwen3-TTS OSS** (the only open-weight one).
- **Chinese / dialects are the star**: **Qwen3-TTS**.
- **Ultra-multilingual (long tail)**: **Google Chirp 3 HD / Azure**.
- **Easy voice cloning**: **ElevenLabs**. **Consent systems matter**: **Azure Custom Neural Voice / OSS**.
- **Conversational low latency**: **ElevenLabs Flash / Qwen realtime**.
- **OpenAI-stack integration / cost**: **gpt-4o-mini-tts**.
- **Cost-first and neutral**: **Qwen3-TTS-Flash** (among the cheapest, new-user free tier).

TTS selection is the work of discerning "fit with requirements," not "product superiority." On a multilingual dubbing project, I weighed quality, latency, cost, and data constraints to select a TTS and built it into the production pipeline ([the AI video-localization platform](/case-studies/ai-video-localization-lipsync)). **"Choose the TTS best suited to your requirements and put it into production" — I work alongside you from vendor-neutral selection through implementation and operations.** Feel free to reach out from the requirements-gathering stage.

---

### References (official documentation)

- [Qwen-TTS (Alibaba Cloud Model Studio)](https://www.alibabacloud.com/help/en/model-studio/qwen-tts) / [QwenLM/Qwen3-TTS (GitHub)](https://github.com/QwenLM/Qwen3-TTS)
- [ElevenLabs Models](https://elevenlabs.io/docs/overview/models) / [Pricing](https://elevenlabs.io/pricing/api)
- [OpenAI gpt-4o-mini-tts](https://platform.openai.com/docs/models/gpt-4o-mini-tts) / [Text to Speech guide](https://developers.openai.com/api/docs/guides/text-to-speech)
- [Google Cloud Text-to-Speech: Chirp 3 HD](https://docs.cloud.google.com/text-to-speech/docs/chirp3-hd) / [Pricing](https://cloud.google.com/text-to-speech/pricing)
- [Azure AI Speech pricing](https://azure.microsoft.com/en-us/pricing/details/speech/)