A generative-AI voice chatbot
A RAG-powered generative-AI voice concierge that replaces and augments in-store service (a two-tier setup: a kiosk UI for visitors + a conversation-analytics console for operators)
Client
A generative-AI voice-concierge kiosk that replaces and augments in-store service (for retail stores handling specialized goods such as tires and wheels; a multi-tenant setup operating multiple stores and devices on a single foundation) | Form: in-store devices (kiosks) + a cloud operations console
My role
AI systems architect and full-stack engineer (solo across the voice-dialogue pipeline and RAG design, the React frontend, the Flask backend, IaC for the AWS infrastructure, and CI/CD).
Challenge (Situation & Task)
Service for goods requiring specialized knowledge had to work with limited staff and without quality differences between stores. Catalogs and specs are scattered across PDF, Excel, images, and video, and inquiries that only veteran staff can answer instantly arose daily. "Natural voice conversation" was a requirement, but processing transcription → generation → speech synthesis serially took several seconds or more — breaking the feel of in-person service.
The essential difficulties of this project were four.
-
"No wrong answers allowed" for specialized goods: getting a product number or size wrong (e.g. tire size 225-60-15) leads directly to mis-orders and complaints. The generative AI's fluency had to coexist with the strict accuracy of business data.
-
Leveraging scattered unstructured data: a mechanism was essential to traverse large volumes of differently formatted documents — catalogs (PDF), spec sheets (Excel), product images and explainer videos (mp4) — searching by "meaning," not keywords, to ground answers.
-
Voice latency that withstands in-person service: the heavy serial chain "transcription (STT) → input check → search → generation → speech synthesis (TTS)" had to be shortened to a speed that feels like talking to a person.
-
Safety and an improvement loop for unattended operation: because you can't control what a visitor says at an unattended kiosk, filtering inappropriate utterances, session management, and an operations foundation letting operators see "which answers missed" and continuously improve were indispensable.
Why these technologies (Rationale)
Adopted AWS Bedrock (Claude 3.5 Sonnet) for the generative LLM: balances the naturalness of Japanese service with the control of keeping data inside AWS. Used commonly via langchain_aws for three purposes: "answer generation," "input moderation," and "deciding whether to present media."
STT is OpenAI Whisper (whisper-1), TTS is AWS Polly (the "Takumi" voice / 125% read speed via SSML): tuned at the implementation level for Japanese recognition accuracy and a speaking rate that's easy to hear in service.
Embeddings use OpenAI text-embedding-3-large (1024 dimensions): prioritizing semantic-search accuracy across diverse product documents while capping dimensions at 1024 to balance search cost.
The vector store is consolidated on PostgreSQL + pgvector: rather than adding a dedicated vector DB, business data and embeddings run on a single RDB — judged advantageous for cost, operational load, and transactional consistency.
The heavy voice pipeline is made async with flask-executor background tasks + parallel fan-out + polling: avoids API Gateway's 29-second limit while running mutually independent steps (search, history fetch, QA-chain init) in parallel to shorten perceived latency.
Auth is a two-tier setup by purpose: the unattended kiosk uses a 6-digit access code → JWT (HttpOnly Cookie, CSRF token); the operations console uses AWS Cognito (SRP auth). The security boundary is separated between visitors and operators.
Infrastructure is fully IaC with Terraform: VPC / ECS Fargate / RDS (pgvector) / API Gateway / CloudFront / Cognito / Lambda are managed as reproducible code, running staging / production in the same configuration.
What I did (Action)
Implementing a natural voice-dialogue loop: the Web Audio API's AnalyserNode continuously analyzes volume, auto-stopping recording on silence (~3 seconds) and auto-ending the session after ~45 seconds. The avatar switches among 5 states — idle / listening / thinking / speaking / awaiting touch — based on the record→send→play state, conveying the dialogue status visually.
Building an async inference pipeline: a voice POST immediately returns a taskId, and the backend runs "Whisper transcription → input moderation by Claude → embedding generation → pgvector similarity search (top-10) → a LangChain QA chain (Claude 3.5 Sonnet, last 10 turns as context, 200-character limit) → Polly speech synthesis → Base64 return" in parallel on flask-executor. The client polls /api/task/{id} for the result.
Structurally eliminating hallucinations: product numbers and sizes — where a mix-up is unacceptable — are not left to the LLM but extracted from speech by regex and deterministically matched against a normalized_data table, removing the generative AI's ambiguity from business-critical points.
Input moderation: a visitor's utterance is classified ACCEPT/REJECT by Claude 3.5 Sonnet (temperature 0.1), blocking inappropriate or irrelevant utterances before answer generation to ensure the safety of unattended operation.
Multimodal presentation: Claude decides "whether to show an image or video," attaching the relevant document (stored in S3, signed URLs) played via video.js — visually supplementing product explanations that voice alone can't convey.
An operations foundation for a continuous-improvement loop: the operations console lets operators search and review all conversations and annotate the failure reason and the "ideal answer" as training data. Re-embedding and re-injecting this builds a mechanism that improves accuracy while in operation. Per-conversation process_time is also recorded to ensure observability.
IaC and CI/CD for the AWS infrastructure: Terraform builds VPC through ECS Fargate, RDS (pgvector), API Gateway (VPC Link → NLB → ALB), CloudFront, Cognito, and 3 Lambdas. GitHub Actions automates the backend (Docker build → ECR → ECS rolling deploy) and the frontend (S3 sync → CloudFront invalidation).
Latency design was the single biggest piece of engineering. Voice service has the essential constraint that "processing serially breaks down." So processing was split into two stages. First, at the HTTP layer, a voice upload immediately returns the transcription result and a taskId, with heavy generation made a background task (also avoiding API Gateway's 29-second timeout). Second, inside the pipeline, mutually independent steps (session fetch, conversation-history fetch, vector search, QA-chain init) are fanned out in parallel with flask-executor's submit and as_completed to shorten the critical path.
Accuracy is ensured by a "generation + rules" hybrid. Fluent explanatory text is left to Claude 3.5 Sonnet, while product numbers — where errors are fatal — fall back to deterministic processing: regex extraction + master matching. Matching the RAG search basis (similarity stored in vector_search_results) against conversation logs makes it possible to trace, after the fact, why a given answer was reached.
The data model expresses the full conversation context in 11 tables (a User→Terminal→Session→Conversation→Chat hierarchy connected to RAG, Document (pgvector), Attachment, NormalizedData, and BackgroundTaskResult). Even unattended, it has a structure that turns quality evaluation and improvement per conversation.
Key technical decisions
AWS Bedrock (Claude 3.5 Sonnet): balancing the naturalness of Japanese service with data control
pgvector + OpenAI Embeddings (text-embedding-3-large / 1024 dims): semantic-search RAG consolidated on an RDB
Async, parallel pipeline + polling via flask-executor: shortening voice latency
Two-tier auth (kiosk = access code + JWT / operator = Cognito): security separation by purpose
Responsibilities
- Requirements & voice-dialogue / RAG architecture design
- Frontend development (React / TypeScript / Chakra UI)
- Backend development (Python / Flask / LangChain)
- AI pipeline implementation (Bedrock / Whisper / Polly / pgvector)
- Infrastructure build (AWS / Terraform / ECS Fargate / API Gateway)
- Security & auth design (Cognito / JWT) and CI/CD
Technologies
Results in numbers
- Response speed
- 1.5s5s → 1.5s~70% shorter via the async, parallel pipeline.
- Answer accuracy
- 90%+Reached a practical level via RAG + part-number master matching (not stopping at PoC).
- Store staff's service burden
- 50%Reduced via AI service.
Results
- Implemented generative-AI voice service — which often stalls at PoC — to a level that runs in production as an unattended kiosk
- Made the heavy chain "transcription → input moderation → RAG search → generation → speech synthesis" async and parallel, achieving a response speed that withstands in-person service (5 seconds → ~1.5 seconds)
- A hybrid design that deterministically matches product numbers/sizes structurally eliminates fatal wrong answers for specialized goods
- Cross-leveraged product information scattered across PDF, Excel, images, and video via pgvector semantic search, raising answer accuracy to a practical level
- Established a continuous-improvement loop that raises accuracy while in operation, via conversation review + training-data annotation in the operations console
- Balanced the safety of unattended operation with log auditability via two-tier auth — visitors (access code + JWT) and operators (Cognito)
- Maintained staging / production reproducibly and with low operational load via full IaC with Terraform and automated GitHub Actions deploys
- Concentrated siloed specialized knowledge into the AI, easing store staff's service burden
同様の課題、抱えていませんか?
あなたのビジネス課題も、最新の技術で解決できます。 まずは30分の無料技術相談から、状況をお聞かせください。
自社の課題もSaaS化できるか相談するプロジェクト単位(請負)・技術顧問、どちらにも対応可能です