# Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

> A big-picture guide to putting voice AI (speech recognition STT, speech synthesis TTS, voice agents) into production. As a map to each deep-dive article, it explains from a practitioner's view the tech selection of each layer — ear (Whisper) → brain (LLM) → mouth (Qwen-TTS) — low-latency design for real-time dialogue, preprocessing (source separation, VAD), idempotent caching, resilience, observability, cost, a11y, and the ethics of voice cloning.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: 音声合成, 音声認識, 生成AI, Qwen, アーキテクチャ設計, 技術選定
- URL: https://tomodahinata.com/en/blog/voice-ai-production-guide-stt-tts-voice-agents
- Category: Voice AI

## Key points

- Voice AI is a chain of 'ear (STT) → brain (LLM) → mouth (TTS).' The crux of production-izing is the design of boundaries, resilience, cost, observability, and ethics, more than the model's smarts.
- Each layer has a representative technology: STT is Whisper, TTS is Qwen-TTS, preprocessing is source separation/VAD, dialogue is real-time synthesis.
- The perceived quality of dialogue is decided by 'time to first audio.' Design batch generation and real-time dialogue as different things.
- The common production equipment is the same across all layers: content-hash idempotent caching, exponential backoff, observability, logs that don't emit PII.
- Build voice cloning into the design with consent, disclosure, and provenance as a 'premise.' This is the trust boundary for enterprise projects.

---

"I want to handle voice with AI" — transcription, read-aloud, automated phone response, multilingual dubbing, a conversing receptionist. The requirements vary, but when it comes to putting it into production, **what you must decide is astonishingly common.** **Which model to choose. Real-time or batch. How to keep cost down. Whether you can redo it if it fails. And — how to hold the line of not cloning someone else's voice without permission.**

This article is the **big picture (map)** for assembling voice AI at **production quality.** It leaves the deep dives of each layer to individual articles and shows here the design blueprint of **"which technology, where, how to combine,"** and the production equipment common to all layers. The material is the [generative-AI voice chatbot](/case-studies/ai-voice-chatbot) (STT→LLM→TTS voice concierge) and [AI video-localization platform](/case-studies/ai-video-localization-lipsync) (multilingual dubbing) I actually built.

> **Rules for this article**: each model's specs and pricing are based on **its official documentation (as of June 2026).** Since pricing and model names are revised quickly, always confirm with primary sources before production. The code is shaped to be usable in real operation, but API keys assume environment variables (never hardcode; don't expose them to the browser).

---

## 0. The map of voice AI: five layers

A voice-AI system can almost always be decomposed into a combination of the following layers. First isolate **which layers your requirements need.**

```text
[① 前処理]      [② STT]        [③ LLM]        [④ TTS]        [⑤ アバター]
音源分離/VAD  →  音声→テキスト → 理解・応答生成 → テキスト→音声 → 口元同期(任意)
（雑音/BGM除去）  （耳）          （脳）          （口）          （顔）
```

Each layer has a "staple deep-dive article." This guide is their hub.

| Layer | Role | Representative tech | Deep-dive article |
| --- | --- | --- | --- |
| ① Preprocessing | Noise/BGM removal, silence detection | Demucs / UVR5 / VAD | [Source-separation tool selection](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter) |
| ② STT | Audio → text | Whisper / gpt-4o-transcribe | [Whisper production-operations guide](/blog/openai-whisper-production-guide-selfhost-vs-api) |
| ③ LLM | Understanding, response generation | Claude / GPT (+RAG) | [Vercel AI SDK production guide](/blog/vercel-ai-sdk-production-llm-apps-streaming-tools-rag) |
| ④ TTS | Text → audio | Qwen-TTS / ElevenLabs, etc. | [Qwen-TTS production-operations guide](/blog/qwen-tts-qwen3-tts-flash-production-guide) |
| ⑤ Avatar | Audio → lip sync | MuseTalk / LatentSync | [Lip sync / digital human](/blog/ai-lip-sync-talking-head-model-selection-guide-2026) |

> **The starting point of design**: it's about a "conversing digital human" where you use all layers. **Transcription only needs ②, read-aloud only needs ④.** Don't add layers your requirements don't have — this is the first cost optimization (YAGNI).

---

## 1. STT (speech recognition): turn audio into text

The "ear" layer. Used for transcribing recordings, minutes, subtitles, search indexes, and the input of a voice agent.

```python
# OpenAI Audio API（運用ゼロ・高精度）。セルフホストなら Whisper turbo。
from openai import OpenAI
client = OpenAI()  # キーは環境変数

with open("speech.mp3", "rb") as f:
    text = client.audio.transcriptions.create(
        model="gpt-4o-transcribe", file=f, language="ja",  # 言語固定で精度↑
    ).text
```

The crux of selection is four axes: **"can the audio go outside (privacy)," "length," "cost structure," "ops setup."** If you can't send confidential audio, self-hosting (Whisper `turbo`/`large`) is the only choice. For details, working around the 25MB limit, and hallucination countermeasures, see the [Whisper production-operations guide](/blog/openai-whisper-production-guide-selfhost-vs-api).

---

## 2. TTS (speech synthesis): turn text into audio

The "mouth" layer. Narration, read-aloud, IVR, dubbing, character voices. This guide's mainstay is [Qwen-TTS](/blog/qwen-tts-qwen3-tts-flash-production-guide) (49+ timbres, 10 languages, 9 Chinese dialects, with an Apache-2.0 OSS version).

```python
import os, dashscope
dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"

res = dashscope.MultiModalConversation.call(
    model="qwen3-tts-flash", api_key=os.getenv("DASHSCOPE_API_KEY"),
    text="本日はご来店ありがとうございます。", voice="Cherry",
    language_type="Japanese", stream=False,
)
audio_url = res.output.audio.url  # ※24時間で失効 → 即・自前ストレージへ退避
```

TTS has broad derived themes, so head to a deep-dive by purpose:

- **Voice selection, dialects, instruction control, the OSS-version overview** → [Qwen-TTS production-operations guide](/blog/qwen-tts-qwen3-tts-flash-production-guide)
- **Comparison/selection vs others (ElevenLabs/OpenAI/Google/Azure)** → [TTS in-depth comparison](/blog/qwen-tts-vs-elevenlabs-openai-google-azure-tts-comparison)
- **A unique voice (voice cloning) and ethics/consent design** → [Voice-cloning production implementation](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide)
- **An accessible "article read-aloud" UI** → [Next.js read-aloud player](/blog/nextjs-qwen-tts-accessible-audio-player-text-to-speech)

---

## 3. Voice agent: connect ear and mouth (real-time dialogue)

Make STT→LLM→TTS a **single low-latency loop** and it becomes a voice agent (automated phone response, reception, concierge). What decides quality here isn't smarts but **timing** — the **speed until the first words come back**, and **barge-in** that goes silent the instant you speak.

```text
ユーザー発話 → ②STT → ③LLM(ストリーミング) → ④TTS(リアルタイム) → 再生
        ↑________________ バージイン（割り込みで即停止）________________|
```

The key is to "not wait for the full text." **Stream to TTS the moment the LLM's first sentence is done** to shrink first_audio_delay. The WebSocket protocol, gapless PCM playback in the browser, and the barge-in implementation are detailed in the [Qwen-TTS real-time voice-agent implementation guide](/blog/qwen-tts-realtime-voice-agent-websocket-streaming-guide). The real example of RAG concierge is the [generative-AI voice chatbot](/case-studies/ai-voice-chatbot).

> **Batch and real-time are different things**: batch transcription of recordings / batch generation of narration, and sequential processing of dialogue (real-time), have different designs. Mistake the requirement and you'll wastefully open a WebSocket, or conversely create a UX that makes users wait (KISS).

---

## 4. Preprocessing: accuracy and cost are decided by "input quality"

Noise, BGM, and silence lower STT accuracy and hurt the feel of TTS/dialogue. **Preprocessing that tidies the input** pays off.

- **Source separation**: extracting "voice only" from BGM and noise raises transcription accuracy. Also essential when dubbing into multiple languages while keeping the BGM. → [Source-separation tool selection](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter)
- **VAD (voice activity detection)**: dropping non-speech segments reduces billing and also suppresses Whisper's "hallucinations."
- **Normalization/splitting**: split long audio at silence boundaries (don't cut mid-sentence).

Most "accuracy isn't there" is caused not by the model but by **input quality.** Preprocessing is unassuming but a high-ROI area.

---

## 5. The map of selection: a task-by-task cheat sheet

Let me organize "which to use in the end" so you can look it up from the task.

| What you want to do | Layers mainly used | First candidate | Deep dive |
| --- | --- | --- | --- |
| Transcription/subtitles of recordings | ② | Whisper `turbo` / gpt-4o-transcribe | [Whisper](/blog/openai-whisper-production-guide-selfhost-vs-api) |
| Read-aloud of articles/materials | ④ | Qwen-TTS (+a11y player) | [Read-aloud UI](/blog/nextjs-qwen-tts-accessible-audio-player-text-to-speech) |
| Automated phone/reception dialogue | ②③④ | STT + LLM + Qwen-TTS realtime | [Real-time](/blog/qwen-tts-realtime-voice-agent-websocket-streaming-guide) |
| Multilingual narration/dubbing | (①)④ | Qwen-TTS (10 languages) | [Qwen-TTS](/blog/qwen-tts-qwen3-tts-flash-production-guide) |
| Read in your/a talent's voice | ④ | OSS voice cloning (+consent design) | [Voice cloning](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide) |
| TTS vendor selection | ④ | Work backward from requirements | [TTS comparison](/blog/qwen-tts-vs-elevenlabs-openai-google-azure-tts-comparison) |
| Accuracy improvement under noise/BGM | ①② | Source separation + VAD + Whisper | [Source separation](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter) |
| Talking avatar/receptionist | ②③④⑤ | The above + lip sync | [Digital human](/blog/ai-lip-sync-talking-head-model-selection-guide-2026) |

---

## 6. Common production equipment (the same across all layers)

Even with different layers, voice AI shares the nature of "**an external API or GPU, with billing, with long-running jobs.**" So the production equipment can be shared too (DRY).

- **Content-hash idempotent caching**: save the result keyed by the SHA-256 of the input (text/audio + parameters). Don't regenerate for the same input = **cost reduction + resumability + no double billing.**
- **Retry with exponential backoff**: retry only transient failures (429/5xx/disconnects) and fail fast on invalid input (4xx). Apply only to idempotent operations.
- **Lifetime management of artifacts**: external temporary resources like TTS generation URLs **expire in 24 hours**, so evacuate them to your own storage immediately on receipt.
- **Observability**: structured-log first_audio_delay (dialogue), RTF (processing-time ratio to audio length), per-character/minute billing, estimated cost, and failure type. Correlate with [OpenTelemetry](/blog/opentelemetry-observability-production-tracing-metrics-logs).
- **Don't emit PII**: audio and scripts can be personal data. Record **only metadata, not the body.**
- **Boundary validation**: [validate with Zod at the boundary](/blog/typescript-type-safety-discipline-zod-nevererror-no-any) the language, timbre, and file format/size. Don't pass user-derived values straight through.
- **a11y**: don't autoplay generated audio, make it keyboard-operable, announce state with `aria-live`, and provide the text alongside ([WCAG 2.2](/blog/react-nextjs-web-accessibility-wcag22-guide)).

Build this common layer once first and call it from each feature — this is the core of maintainability and extensibility (SRP, ETC).

---

## 7. Ethics, compliance, and data residency

Voice AI is the area where "technically possible" and "may do" diverge most. Build it in as a **design premise.**

- **Voice-cloning consent, disclosure, and provenance**: cloning a voice without the person's consent directly leads to impersonation and fraud. Make **a consent ledger, use limitation, expiry, disclosure of AI generation, and provenance** mandatory. → [Voice-cloning governance design](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide)
- **Explicit AI generation**: state that it's synthetic voice (regulatory compliance and ethics).
- **Data residency**: which country/operator do scripts and voices go to? For confidential or sensitive personal data, don't take it outside the boundary, via region selection or OSS self-hosting.
- **Retention policy**: decide the retention period, encryption, and deletion flow first.

---

## 8. Use-case-by-use-case recipes

- **Multilingual e-learning/narration**: cycle `language_type` with a fixed timbre over the script to make 10 languages (④). → [Qwen-TTS](/blog/qwen-tts-qwen3-tts-flash-production-guide)
- **Multilingual video dubbing**: ① source separation → ② transcription → translation → ④ TTS → ⑤ lip sync. The flow I built in the [AI video-localization platform](/case-studies/ai-video-localization-lipsync).
- **Automated phone IVR/reception dialogue**: ②③④ realtime over WebSocket (→ [Real-time](/blog/qwen-tts-realtime-voice-agent-websocket-streaming-guide), [voice chatbot](/case-studies/ai-voice-chatbot)).
- **Accessible read-aloud**: voice articles/documents with ④ and provide them with an a11y player (→ [Read-aloud UI](/blog/nextjs-qwen-tts-accessible-audio-player-text-to-speech)).
- **Broadcasting/production QA**: match telop typos with OCR × ASR (an application of ②) → [Telop typo detection](/blog/telop-typo-detection-ocr-asr-cloud-workflows).

---

## 9. Summary: a voice-AI design cheat sheet

- **First isolate the layers**: do you need only ②, only ④, ②③④ dialogue, or all?
- **STT**: can't go outside / long → self-hosted Whisper. Same-day / high accuracy → gpt-4o-transcribe.
- **TTS**: multilingual/dialect/cheapest → Qwen-TTS. Data sovereignty/unique voice → OSS self-hosting. Selection in the [comparison article](/blog/qwen-tts-vs-elevenlabs-openai-google-azure-tts-comparison).
- **Dialogue**: with first_audio_delay as the metric, stream to TTS sentence by sentence and stop instantly on barge-in.
- **Preprocessing**: raise input quality with source separation and VAD (directly affects accuracy and cost).
- **Common equipment**: idempotent caching, exponential backoff, 24h URL evacuation, observability, no PII emission, a11y.
- **Ethics**: cloning with consent, disclosure, and provenance as a premise. Make data residency a design decision.

Voice AI looks like a "one-line requirement," but it's **the work of designing the trade-offs of layer selection, cost, latency, privacy, and ethics.** I've run both voice concierge (STT→LLM→TTS) and multilingual dubbing in production, assembled to ensure idempotency, resilience, observability, a11y, and ethics. **"Production-ize your voice operations fast, cheap, safe, and usable by anyone" — I accompany you end-to-end from that design through implementation and operations.** Feel free to reach out from the requirements-gathering stage.

---

### References (deep-dive articles / official documentation)

- STT: [Whisper production-operations guide](/blog/openai-whisper-production-guide-selfhost-vs-api)
- TTS: [Qwen-TTS production-operations guide](/blog/qwen-tts-qwen3-tts-flash-production-guide) / [TTS in-depth comparison](/blog/qwen-tts-vs-elevenlabs-openai-google-azure-tts-comparison)
- Dialogue: [Real-time voice agent](/blog/qwen-tts-realtime-voice-agent-websocket-streaming-guide)
- Ethics: [Voice-cloning governance design](/blog/qwen-tts-voice-cloning-self-hosting-consent-governance-guide)
- Preprocessing: [Source-separation tool selection](/blog/music-source-separation-tool-selection-demucs-uvr-spleeter)
- Official: [OpenAI Speech-to-Text](https://developers.openai.com/api/docs/guides/speech-to-text) / [Qwen-TTS (Alibaba Cloud Model Studio)](https://www.alibabacloud.com/help/en/model-studio/qwen-tts)