Category
音声・ボイスAI(音声認識 STT / 音声合成 TTS / 音声エージェント)の実装ガイド
音声AIは『耳(STT)→脳(LLM)→口(TTS)』の連鎖で、本番化の勝負どころはモデルの賢さよりも『型安全な境界・回復性・コスト・可観測性・倫理』の設計です。本クラスタは、Whisperによる音声認識、Qwen-TTSによる多言語・方言・クローン対応の音声合成、最初の音までの遅延を縮めるリアルタイム音声エージェント、TTSの選定、そしてアクセシブルな読み上げUIまで——内容ハッシュの冪等キャッシュ・指数バックオフ・生成URLの退避・first_audio_delayの可観測性・ボイスクローンの同意/開示まで、音声を本番で稼がせる設計を扱います。
9 articles in total
Foundational guide
Foundational guide (start here)
Voice-AI production-implementation guide [2026]: the big picture and tech selection of speech recognition (STT) × speech synthesis (TTS) × voice agents
A big-picture guide to putting voice AI (speech recognition STT, speech synthesis TTS, voice agents) into production. As a map to each deep-dive article, it explains from a practitioner's view the tech selection of each layer — ear (Whisper) → brain (LLM) → mouth (Qwen-TTS) — low-latency design for real-time dialogue, preprocessing (source separation, VAD), idempotent caching, resilience, observability, cost, a11y, and the ethics of voice cloning.
Related practical articles
- Next.js音声合成フロントエンドQwena11y
Next.js × Qwen-TTS: implementing an accessible 'read-article-aloud' player at production quality (WCAG 2.2, type-safe, cache)
A guide to implementing an accessible audio player that reads articles and documents aloud with Next.js 16 and Qwen-TTS. With type-safe real code it explains server-side TTS generation (Zod validation, content-hash cache, hiding the key) and a WCAG 2.2-compliant React player (keyboard operation, aria-live, no autoplay, prefers-reduced-motion, focus management).
10 min read - Python音声合成生成AIQwenアーキテクチャ設計
Qwen-TTS / Qwen3-TTS-Flash Production Guide: A Speech-Synthesis Design for Choosing Between the DashScope API and OSS Across 49 Timbres, 10 Languages, Chinese Dialects, and Voice Cloning
An implementation guide for using Qwen-TTS / Qwen3-TTS at production quality. Explained with real code, faithful to the official docs: the model lineup (qwen3-tts-flash / instruct-flash / realtime / qwen-tts), 49 timbres / 10 languages / 9 Chinese dialects, choosing between the DashScope API (Python / HTTP / streaming) and the OSS version (Apache-2.0 / 3-second voice cloning / voice design), and pricing, idempotency, resilience, observability, and ethics.
21 min read - Python音声合成生成AIQwenリアルタイム
Qwen-TTS real-time voice-agent implementation guide: WebSocket streaming, browser playback, and barge-in (interruption)
A guide to production-implementing a low-latency voice agent that 'replies while talking' with Qwen3-TTS-Flash-Realtime. With real code it explains the WebSocket bidirectional protocol (session.created / response.audio.delta / session.finished), the use of server_commit vs. commit, streaming synthesis of LLM output, gapless PCM 24kHz playback in the browser, barge-in (interruption), connection resilience, and measuring first_audio_delay.
10 min read - Python音声合成生成AIQwenセキュリティ
Qwen-TTS voice-cloning production guide: self-hosting the OSS version (Apache-2.0), and the governance design of consent, disclosure, and provenance
A guide to running, in production, voice cloning from 3 seconds of audio and voice design with the OSS version of Qwen3-TTS (Apache-2.0). With real code, it explains GPU self-hosting setup, a FastAPI inference server (type-safe, idempotent cache, GPU sharing), and most importantly the consent ledger, purpose limitation, AI-generation disclosure, provenance, and audit logs — a design that structurally suppresses impersonation/fraud risk.
9 min read - 音声合成生成AIQwenアーキテクチャ設計コスト効率
TTS in-depth comparison 2026: choosing among Qwen-TTS / ElevenLabs / OpenAI / Google / Azure by cost, multilingual reach, self-hosting, voice cloning, and latency
A selection guide for speech-synthesis (TTS) APIs/models. It compares Qwen3-TTS-Flash, ElevenLabs Flash v2.5, OpenAI gpt-4o-mini-tts, Google Chirp 3 HD, and Azure Neural/Custom Neural Voice across six axes — pricing (per-character vs per-token), supported languages, self-hostability, voice cloning, latency, and data residency — grounded in official sources. It presents a decision framework that works backward from your requirements.
8 min read - Python音声認識OpenAI APIアーキテクチャ設計パフォーマンス
OpenAI Whisper production-operation guide: a transcription design that uses self-hosting (large-v3-turbo) and the Audio API (gpt-4o-transcribe) differently
An implementation guide for using OpenAI Whisper at production quality. Faithful to the official documentation, it organizes the model list (large-v3 / turbo) and the Audio API (whisper-1 / gpt-4o-transcribe / gpt-4o-mini-transcribe), and explains in real code a self-host vs. API selection framework, circumventing the 25MB limit, SRT subtitle generation, prompt-guiding proper nouns, hallucination countermeasures, and idempotency, resumption, and observability.
15 min read - AIRAG音声AIAWS BedrockClaude
Until you run generative-AI voice customer service 'in production': designing an unmanned kiosk with Bedrock × Whisper × Polly × pgvector
Explaining in real code the design for taking a generative-AI voice agent that replaces in-store face-to-face service all the way to production, not a PoC. The real-time voice loop, an asynchronous/parallel inference pipeline, RAG with pgvector, the structural elimination of hallucination, and an AWS production architecture.
13 min read - RAGPythonアーキテクチャ設計GCPパフォーマンス
Automatically detecting telop typos in TV programs: OCR × speech recognition cross-check, Cloud Workflows parallelization, and hybrid-OCR cost optimization
An explanation of an ML pipeline that automatically detects typos in broadcast-program telops (subtitles), with real code as the single source of truth. It digs at the implementation level into hybrid OCR (detect switches with local OCR and apply LLM OCR only to the diffs), the OCR-vs-speech-recognition cross-check, parallelization with Cloud Workflows (about 30% shorter), per-segment idempotent and resumable design, and monotonic progress with Firestore × SSE.
10 min read