Skip to main content
友田 陽大
Voice AI
音声合成
音声認識
生成AI
Qwen
アーキテクチャ設計
技術選定

Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents

A big-picture guide to putting voice AI (speech recognition STT, speech synthesis TTS, voice agents) into production. As a map to each deep-dive article, it explains from a practitioner's view the tech selection of each layer — ear (Whisper) → brain (LLM) → mouth (Qwen-TTS) — low-latency design for real-time dialogue, preprocessing (source separation, VAD), idempotent caching, resilience, observability, cost, a11y, and the ethics of voice cloning.

Published
Reading time
9 min read
Author
友田 陽大
Share

"I want to handle voice with AI" — transcription, read-aloud, automated phone response, multilingual dubbing, a conversing receptionist. The requirements vary, but when it comes to putting it into production, what you must decide is astonishingly common. Which model to choose. Real-time or batch. How to keep cost down. Whether you can redo it if it fails. And — how to hold the line of not cloning someone else's voice without permission.

This article is the big picture (map) for assembling voice AI at production quality. It leaves the deep dives of each layer to individual articles and shows here the design blueprint of "which technology, where, how to combine," and the production equipment common to all layers. The material is the generative-AI voice chatbot (STT→LLM→TTS voice concierge) and AI video-localization platform (multilingual dubbing) I actually built.

Rules for this article: each model's specs and pricing are based on its official documentation (as of June 2026). Since pricing and model names are revised quickly, always confirm with primary sources before production. The code is shaped to be usable in real operation, but API keys assume environment variables (never hardcode; don't expose them to the browser).


0. The map of voice AI: five layers

A voice-AI system can almost always be decomposed into a combination of the following layers. First isolate which layers your requirements need.

[① 前処理]      [② STT]        [③ LLM]        [④ TTS]        [⑤ アバター]
音源分離/VAD  →  音声→テキスト → 理解・応答生成 → テキスト→音声 → 口元同期(任意)
(雑音/BGM除去)  (耳)          (脳)          (口)          (顔)

Each layer has a "staple deep-dive article." This guide is their hub.

LayerRoleRepresentative techDeep-dive article
① PreprocessingNoise/BGM removal, silence detectionDemucs / UVR5 / VADSource-separation tool selection
② STTAudio → textWhisper / gpt-4o-transcribeWhisper production-operations guide
③ LLMUnderstanding, response generationClaude / GPT (+RAG)Vercel AI SDK production guide
④ TTSText → audioQwen-TTS / ElevenLabs, etc.Qwen-TTS production-operations guide
⑤ AvatarAudio → lip syncMuseTalk / LatentSyncLip sync / digital human

The starting point of design: it's about a "conversing digital human" where you use all layers. Transcription only needs ②, read-aloud only needs ④. Don't add layers your requirements don't have — this is the first cost optimization (YAGNI).


1. STT (speech recognition): turn audio into text

The "ear" layer. Used for transcribing recordings, minutes, subtitles, search indexes, and the input of a voice agent.

# OpenAI Audio API(運用ゼロ・高精度)。セルフホストなら Whisper turbo。
from openai import OpenAI
client = OpenAI()  # キーは環境変数

with open("speech.mp3", "rb") as f:
    text = client.audio.transcriptions.create(
        model="gpt-4o-transcribe", file=f, language="ja",  # 言語固定で精度↑
    ).text

The crux of selection is four axes: "can the audio go outside (privacy)," "length," "cost structure," "ops setup." If you can't send confidential audio, self-hosting (Whisper turbo/large) is the only choice. For details, working around the 25MB limit, and hallucination countermeasures, see the Whisper production-operations guide.


2. TTS (speech synthesis): turn text into audio

The "mouth" layer. Narration, read-aloud, IVR, dubbing, character voices. This guide's mainstay is Qwen-TTS (49+ timbres, 10 languages, 9 Chinese dialects, with an Apache-2.0 OSS version).

import os, dashscope
dashscope.base_http_api_url = "https://dashscope-intl.aliyuncs.com/api/v1"

res = dashscope.MultiModalConversation.call(
    model="qwen3-tts-flash", api_key=os.getenv("DASHSCOPE_API_KEY"),
    text="本日はご来店ありがとうございます。", voice="Cherry",
    language_type="Japanese", stream=False,
)
audio_url = res.output.audio.url  # ※24時間で失効 → 即・自前ストレージへ退避

TTS has broad derived themes, so head to a deep-dive by purpose:


3. Voice agent: connect ear and mouth (real-time dialogue)

Make STT→LLM→TTS a single low-latency loop and it becomes a voice agent (automated phone response, reception, concierge). What decides quality here isn't smarts but timing — the speed until the first words come back, and barge-in that goes silent the instant you speak.

ユーザー発話 → ②STT → ③LLM(ストリーミング) → ④TTS(リアルタイム) → 再生
        ↑________________ バージイン(割り込みで即停止)________________|

The key is to "not wait for the full text." Stream to TTS the moment the LLM's first sentence is done to shrink first_audio_delay. The WebSocket protocol, gapless PCM playback in the browser, and the barge-in implementation are detailed in the Qwen-TTS real-time voice-agent implementation guide. The real example of RAG concierge is the generative-AI voice chatbot.

Batch and real-time are different things: batch transcription of recordings / batch generation of narration, and sequential processing of dialogue (real-time), have different designs. Mistake the requirement and you'll wastefully open a WebSocket, or conversely create a UX that makes users wait (KISS).


4. Preprocessing: accuracy and cost are decided by "input quality"

Noise, BGM, and silence lower STT accuracy and hurt the feel of TTS/dialogue. Preprocessing that tidies the input pays off.

  • Source separation: extracting "voice only" from BGM and noise raises transcription accuracy. Also essential when dubbing into multiple languages while keeping the BGM. → Source-separation tool selection
  • VAD (voice activity detection): dropping non-speech segments reduces billing and also suppresses Whisper's "hallucinations."
  • Normalization/splitting: split long audio at silence boundaries (don't cut mid-sentence).

Most "accuracy isn't there" is caused not by the model but by input quality. Preprocessing is unassuming but a high-ROI area.


5. The map of selection: a task-by-task cheat sheet

Let me organize "which to use in the end" so you can look it up from the task.

What you want to doLayers mainly usedFirst candidateDeep dive
Transcription/subtitles of recordingsWhisper turbo / gpt-4o-transcribeWhisper
Read-aloud of articles/materialsQwen-TTS (+a11y player)Read-aloud UI
Automated phone/reception dialogue②③④STT + LLM + Qwen-TTS realtimeReal-time
Multilingual narration/dubbing(①)④Qwen-TTS (10 languages)Qwen-TTS
Read in your/a talent's voiceOSS voice cloning (+consent design)Voice cloning
TTS vendor selectionWork backward from requirementsTTS comparison
Accuracy improvement under noise/BGM①②Source separation + VAD + WhisperSource separation
Talking avatar/receptionist②③④⑤The above + lip syncDigital human

6. Common production equipment (the same across all layers)

Even with different layers, voice AI shares the nature of "an external API or GPU, with billing, with long-running jobs." So the production equipment can be shared too (DRY).

  • Content-hash idempotent caching: save the result keyed by the SHA-256 of the input (text/audio + parameters). Don't regenerate for the same input = cost reduction + resumability + no double billing.
  • Retry with exponential backoff: retry only transient failures (429/5xx/disconnects) and fail fast on invalid input (4xx). Apply only to idempotent operations.
  • Lifetime management of artifacts: external temporary resources like TTS generation URLs expire in 24 hours, so evacuate them to your own storage immediately on receipt.
  • Observability: structured-log first_audio_delay (dialogue), RTF (processing-time ratio to audio length), per-character/minute billing, estimated cost, and failure type. Correlate with OpenTelemetry.
  • Don't emit PII: audio and scripts can be personal data. Record only metadata, not the body.
  • Boundary validation: validate with Zod at the boundary the language, timbre, and file format/size. Don't pass user-derived values straight through.
  • a11y: don't autoplay generated audio, make it keyboard-operable, announce state with aria-live, and provide the text alongside (WCAG 2.2).

Build this common layer once first and call it from each feature — this is the core of maintainability and extensibility (SRP, ETC).


7. Ethics, compliance, and data residency

Voice AI is the area where "technically possible" and "may do" diverge most. Build it in as a design premise.

  • Voice-cloning consent, disclosure, and provenance: cloning a voice without the person's consent directly leads to impersonation and fraud. Make a consent ledger, use limitation, expiry, disclosure of AI generation, and provenance mandatory. → Voice-cloning governance design
  • Explicit AI generation: state that it's synthetic voice (regulatory compliance and ethics).
  • Data residency: which country/operator do scripts and voices go to? For confidential or sensitive personal data, don't take it outside the boundary, via region selection or OSS self-hosting.
  • Retention policy: decide the retention period, encryption, and deletion flow first.

8. Use-case-by-use-case recipes

  • Multilingual e-learning/narration: cycle language_type with a fixed timbre over the script to make 10 languages (④). → Qwen-TTS
  • Multilingual video dubbing: ① source separation → ② transcription → translation → ④ TTS → ⑤ lip sync. The flow I built in the AI video-localization platform.
  • Automated phone IVR/reception dialogue: ②③④ realtime over WebSocket (→ Real-time, voice chatbot).
  • Accessible read-aloud: voice articles/documents with ④ and provide them with an a11y player (→ Read-aloud UI).
  • Broadcasting/production QA: match telop typos with OCR × ASR (an application of ②) → Telop typo detection.

9. Summary: a voice-AI design cheat sheet

  • First isolate the layers: do you need only ②, only ④, ②③④ dialogue, or all?
  • STT: can't go outside / long → self-hosted Whisper. Same-day / high accuracy → gpt-4o-transcribe.
  • TTS: multilingual/dialect/cheapest → Qwen-TTS. Data sovereignty/unique voice → OSS self-hosting. Selection in the comparison article.
  • Dialogue: with first_audio_delay as the metric, stream to TTS sentence by sentence and stop instantly on barge-in.
  • Preprocessing: raise input quality with source separation and VAD (directly affects accuracy and cost).
  • Common equipment: idempotent caching, exponential backoff, 24h URL evacuation, observability, no PII emission, a11y.
  • Ethics: cloning with consent, disclosure, and provenance as a premise. Make data residency a design decision.

Voice AI looks like a "one-line requirement," but it's the work of designing the trade-offs of layer selection, cost, latency, privacy, and ethics. I've run both voice concierge (STT→LLM→TTS) and multilingual dubbing in production, assembled to ensure idempotency, resilience, observability, a11y, and ethics. "Production-ize your voice operations fast, cheap, safe, and usable by anyone" — I accompany you end-to-end from that design through implementation and operations. Feel free to reach out from the requirements-gathering stage.


References (deep-dive articles / official documentation)

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading