Consultations of "I want to automate store customer service with generative AI" have increased. If it's just making a demo, it now works over a weekend. Take the mic with getUserMedia, transcribe with Whisper, throw it at the LLM, and read it out with TTS — just this.
But placing it at an unmanned storefront, in a state where you don't know what visitors will say, and keeping it running without breaking every day is a completely different story. This article is a record of bridging, with real code, "the distance between a demo and production," from the experience of designing, implementing, and taking to production a generative-AI voice-customer-service kiosk for a retail store handling specialized merchandise (e.g., products like tires where the product number and size are everything).
The target readers are engineers and tech leads who want to put a generative-AI product on the business "without ending it at verification." The stack is Python (Flask) / React (TypeScript) / AWS (Bedrock, ECS Fargate, API Gateway, RDS+pgvector, Cognito, Lambda) / Terraform. The details of the related track record are left to the case study at the end; here I concentrate on the design judgments and trade-offs.
The five walls that separate a demo from production
When running voice customer service in production, the walls you definitely hit were the following five.
| Wall | Ignorable in a demo | Fatal in production |
|---|---|---|
| Latency | can wait 5 seconds | 5 seconds of silence face-to-face means the conversation doesn't hold |
| Accuracy | fine if it's plausible | an error in product number/size directly ties to a wrong order/complaint |
| Safety | only the developer talks | visitors say anything (irrelevant/inappropriate utterances) |
| Improvement | fix by hand | need a mechanism to recover "which answer missed" in operation |
| Operation | fine if it works locally | zero-downtime deploy, reproducible infrastructure, audit logs |
Below, I crush these five one by one.
Overall architecture: separate the two surfaces
The first design judgment was to completely separate "the surface visitors touch" and "the surface operators touch." Because both the requirements and the security boundaries are entirely different.
- Storefront kiosk (for visitors): converses by voice. Authentication is a short access code tied to the store terminal. Thoroughly optimized for "the speed of conversation."
- Operation console (for operators): analysis of conversation logs, management of the knowledge (RAG), terminal management. Authentication is AWS Cognito. Thoroughly optimized for "accuracy and auditability."
The request flow is as follows. The operation console verifies the JWT with API Gateway's Cognito authorizer, then passes to the internal ECS via VPC Link.
[customer] ── audio ──▶ CloudFront ──▶ ALB ──▶ ECS(Flask) ──▶ RDS(pgvector) / S3
│
├─▶ Bedrock (Claude 3.5 Sonnet)
├─▶ OpenAI (Whisper / Embeddings)
└─▶ Polly (TTS)
[operator] ── JWT ──▶ API Gateway ──(Cognito Authorizer)──▶ VPC Link ──▶ NLB ──▶ ALB ──▶ ECS
"Why interpose API Gateway only for the operation console" is described later, but the point is to lean Cognito authentication on the managed side and not expose the internal ECS directly to the internet.
Wall ① latency: serial processing definitely breaks down
A voice-customer-service pipeline is essentially heavy serial processing.
record → STT(transcription) → input check → vector search → LLM generation → TTS(speech synthesis) → return
If you write this straightforwardly in serial, each step's several hundred ms to several seconds pile up and easily exceed 5 seconds. In face-to-face service, 5 seconds of silence is the time judged as "it froze." I attacked here in two stages.
Stage 1: return HTTP immediately, push generation to the background
POST /api/conversation, which uploads the audio, immediately returns only the transcription result and a taskId, and pushes the heavy generation processing to a background task. The client polls for the result using the taskId.
This design has two practical benefits. One is perceived latency (the user gets "what was heard" returned immediately). The other is that it structurally avoids API Gateway's 29-second timeout constraint. Even on a day when LLM generation is slow, the HTTP layer doesn't clog.
In Flask, I implemented this with flask-executor (a thread pool).
class ConversationResource(Resource):
@jwt_required()
def post(self):
executor = current_app.config["EXECUTOR_CLIENT"]
audio_file = self._decode_audio(request)
# 文字起こしと認証情報の取得は互いに独立 → 並列に流す
futures = {
executor.submit(
openai.audio.transcriptions.create,
model="whisper-1",
file=audio_file,
language="ja",
): "transcription",
executor.submit(self.get_access_code): "access_code",
}
results = {}
for future in as_completed(futures):
results[futures[future]] = future.result()
transcript = results["transcription"].text
task = BackgroundTaskResult(encoded_audio=None)
db.session.add(task)
db.session.commit()
# 重い生成処理はバックグラウンドへ。HTTPはここで即返す。
executor.submit(self.background_task, transcript, results["access_code"], task.id)
return {"transcript": transcript, "taskId": task.id}, 202
The polling side distinguishes "still processing (202)" and "completed (200 + audio)."
class BackgroundTaskResultResource(Resource):
def get(self, task_id):
task = BackgroundTaskResult.query.get_or_404(task_id)
if task.encoded_audio is None:
return {"status": "in_progress"}, 202
return {"status": "completed", "audio": task.encoded_audio}, 200
Note: I chose a thread pool because this pipeline is dominated by I/O-bound (external-API waiting) and Gunicorn runs with
eventletworkers. If CPU-bound preprocessing increases, the next move here is to lean toward SQS + a dedicated worker. It's important to keep the order of "first make it work, and move it to the right place when the bottleneck becomes visible."
Stage 2: parallel fan-out inside the pipeline
Inside the background task too, processing with no dependency relationship is flowed in parallel. For example, "fetching session info," "input moderation," and "embedding generation" are mutually independent, so run them simultaneously to shorten the critical path.
def background_task(self, transcript, access_code, task_id):
executor = current_app.config["EXECUTOR_CLIENT"]
# フェーズ1: 互いに独立 → ファンアウト
futures = {
executor.submit(self.get_session, access_code): "session",
executor.submit(self.filter_question_content, transcript): "moderation",
executor.submit(self.get_embedding, transcript): "embedding",
}
phase1 = {futures[f]: f.result() for f in as_completed(futures)}
if phase1["moderation"] == "REJECT":
return self._store_rejection(task_id) # 生成に進ませない
# フェーズ2: 検索・履歴取得・QAチェーン初期化も独立 → ファンアウト
futures = {
executor.submit(self.retrieve_documents, phase1["embedding"]): "docs",
executor.submit(self.get_past_conversations, phase1["session"].id): "history",
executor.submit(load_qa_chain, self.claude_llm): "qa_chain",
}
phase2 = {futures[f]: f.result() for f in as_completed(futures)}
answer = self._generate(phase2["qa_chain"], phase2["docs"], phase2["history"], transcript)
encoded = self.encode_audio(answer) # Polly → base64
self._persist(task_id, encoded, answer)
The point isn't to "make everything asynchronous" but to draw the dependency graph and flow only the independent edges simultaneously. Blind parallelization is a hotbed of bugs.
Wall ② accuracy: eliminate wrong answers with a hybrid of generation and rules
The scariest thing with specialized merchandise is a "fluent wrong answer." The LLM confidently states a wrong product number. If it misspeaks the tire size 225/60R15 as 225/65R15, it directly ties to a wrong order.
The design principle here is clear. "Where ambiguity is allowed, lean to generation; where an error is fatal, lean to rules."
Don't leave the extraction of product number/size to the LLM; extract it deterministically with regex from the utterance text and match it against the master table (normalized_data).
TIRE_SIZE = re.compile(r"(\d{3})\s*[-//]?\s*(\d{2})\s*[-/Rr]?\s*(\d{2})")
def resolve_product_number(self, transcript: str) -> str | None:
m = TIRE_SIZE.search(transcript)
if not m:
return None
w, aspect, rim = m.groups()
row = NormalizedData.query.filter_by(
digit_1=w, digit_2=aspect, digit_3=rim
).first()
return row.product_number if row else None
If it can be matched, pass that fixed value to the LLM as a constraint of the answer. Divide labor — generative AI plays the role of making "the words of explanation," and the master plays the role of holding "the truth of the numbers." With just this, complaint-class wrong answers almost disappear.
The LLM-side generation, too, isn't used unconstrained. While passing the searched top documents and the recent conversation history to LangChain's QA chain, constrain the answer length (since it's read out by voice, an overly-long answer is a disqualification as customer service).
prompt = (
"あなたは店舗の接客スタッフです。以下の参考情報と会話履歴だけを根拠に、"
"200文字以内で、確認できない事項は断定せず案内してください。\n"
f"# 確定した商品番号: {product_number or '未確定'}\n"
f"# 直近の会話:\n{recent_history}\n"
f"# 質問: {transcript}"
)
answer = qa_chain.run({"input_documents": documents, "question": prompt})
Wall ③ safety: gate the utterance before generation
An unmanned kiosk can't control "who says what." Irrelevant chatter, inappropriate utterances, and aggressive prompts fly in. So I placed a gate that classifies and blocks the input before proceeding to generation. Using Claude 3.5 Sonnet as a low-temperature (temperature=0.1) classifier, I make it return only ACCEPT / REJECT.
def filter_question_content(self, transcript: str) -> str:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 5,
"temperature": 0.1,
"messages": [{
"role": "user",
"content": (
"次の発話が店舗接客として適切なら ACCEPT、"
"成人向け・政治・誹謗中傷・商材と無関係なら REJECT のみを返答:\n"
f"{transcript}"
),
}],
})
res = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
body=body,
)
verdict = json.loads(res["body"].read())["content"][0]["text"].strip()
return "REJECT" if "REJECT" in verdict else "ACCEPT"
Commonly using the same Bedrock Claude for the three purposes of "answer generation," "input moderation," and "judging whether to show an image/video" is also a cost/operational contrivance. Without increasing providers, model updates are done in one place.
RAG foundation: why pgvector, and the limit of a full scan
For the foundation of knowledge search (RAG), I chose PostgreSQL + pgvector without adding a dedicated vector DB. There are three reasons.
- Simplicity of operation: the business data and the embeddings ride on the same RDB. Backups, permissions, and transactions are unified.
- Cost: don't increase the monthly fee or the separate monitoring of Pinecone, etc.
- Consistency: conversations, documents, and search results (similarity scores) can be handled within the same transaction boundary.
For embeddings, I used OpenAI text-embedding-3-large truncated to 1024 dimensions. It's a balance point that secures search accuracy while holding down storage/compute cost.
Let me write this honestly. The initial implementation's search was a naive method that reads out all documents and computes the inner product on the Python side.
def retrieve_documents(self, query_embedding):
cur = self.conn.cursor()
cur.execute("SELECT id, content, vector FROM documents")
scored = [
(np.dot(query_embedding, vec), content, doc_id)
for doc_id, content, vec in cur.fetchall()
]
scored.sort(key=lambda x: x[0], reverse=True)
top = scored[:10]
return [LangchainDocument(page_content=c) for _, c, _ in top], top
At a per-store catalog scale, this works practically enough. It's the judgment of avoiding premature optimization and first making it work correctly. But this is an O(n) full scan and breaks down when documents exceed tens of thousands. The correct migration target at scale is to lean toward pgvector's index (HNSW / IVFFlat) and distance operators.
-- スケール時の移行先: DB側でANN検索を完結させる
CREATE INDEX ON documents USING hnsw (vector vector_cosine_ops);
SELECT id, content
FROM documents
ORDER BY vector <=> %(query)s -- コサイン距離。インデックスが効く
LIMIT 10;
"Make it work correctly at the current scale, and clearly state the limit and the path beyond it" — this, I believe, is RAG operation without bluffing. The search results are saved along with the similarity score in vector_search_results, so that you can later trace 'why that answer came out.' This pays off in the next "improvement loop."
Wall ④ improvement: a loop that raises accuracy while operating
The biggest difference between a PoC and production is "whether there's a mechanism to recover and fix the answers that missed." The data model was designed to hold the full context of the conversation (summary).
User ─< Terminal ─< Session ─< Conversation ─< Chat
│
├─ vector_search_results (search basis + similarity)
└─ documents (pgvector embeddings) ─< attachments(image/video)
normalized_data (product-number master) background_task_results (async handle for generated audio)
Conversation holds failure_reason (a code like speech-recognition failure or moderation rejection), process_time (the measured latency), and furthermore, as teacher data added later by the operator, failure_reason_should_be and system_response_message_should_be.
When an operator who looked at a conversation in the operation console inputs "this answer should originally have been like this," it can be re-embedded and re-injected into the knowledge. That is, operation itself becomes a loop that produces learning data. If you accumulate process_time, you can also detect latency regression.
Wall ⑤ operation: two-layer authentication and reproducible infrastructure
Separate authentication by purpose
Handling visitors and operators with the same authentication is poor design. So I separated it into two layers.
- Kiosk: verify a 6-digit access code tied to the store terminal, and issue a JWT in an
HttpOnlycookie (access 24h / refresh 7d, with a CSRF token). Visitors don't do a login operation. - Operation console: AWS Cognito (SRP authentication). Verify the JWT with API Gateway's Cognito authorizer, then pass it inside.
# キオスク: アクセスコード → JWT(Cookie)
class AccessCodeResource(Resource):
def post(self):
code = AccessCodeSchema().load(request.get_json())["accessCode"]
terminal = Terminal.query.filter_by(access_code=code, is_active=True).first()
if terminal is None:
return {"message": "invalid code"}, 401
access = create_access_token(identity={"token": code},
expires_delta=timedelta(hours=24))
resp = make_response({"ok": True})
set_access_cookies(resp, access, max_age=24 * 60 * 60) # HttpOnly + CSRF
return resp
Input is boundary-validated with Marshmallow schemas, and DB access is via SQLAlchemy's ORM (parameterized queries). Catalog images/videos on S3 are distributed only by signed URLs, and the bucket isn't made public.
Infrastructure fully coded with Terraform
So that staging and production can be reproduced with the same configuration, I managed everything from the VPC to the app with Terraform. The backend is ECS Fargate (lightweight tasks, zero-downtime with a rolling deploy).
resource "aws_ecs_service" "backend" {
name = "${var.env}-backend"
cluster = aws_ecs_cluster.backend.id
task_definition = aws_ecs_task_definition.backend.arn
launch_type = "FARGATE"
desired_count = 1
deployment_configuration {
maximum_percent = 200 # 新タスクを立ててから
minimum_healthy_percent = 100 # 旧タスクを落とす = 無停止
}
}
The RAG document ingestion, being heavy processing (embedding generation), was separated into a VPC-attached Lambda. While writing directly to RDS, it exits to OpenAI via NAT.
resource "aws_lambda_function" "insert_rag_document" {
function_name = "${var.env}-insert-rag-document"
runtime = "python3.11"
timeout = 300 # 大きめのPDFも完走させる
vpc_config { # RDS(pgvector)へ到達するためVPC内に置く
subnet_ids = [var.lambda_subnet_id]
security_group_ids = [var.lambda_sg_id]
}
}
Why does the operation console alone have many layers, "API Gateway → VPC Link → NLB → ALB → ECS"? It's to offload Cognito authentication to API Gateway's authorizer while not exposing the internal ECS publicly. Since VPC Link requires an NLB as the connection target, the configuration places an NLB in front of the ALB.
Deployment automated with GitHub Actions
# バックエンド: ビルド → ECRへpush → ECSローリングデプロイ
- run: docker build -t $ECR_REPO:latest -f backend/Dockerfile.prod ./backend
- run: docker push $ECR_REPO:latest
- run: |
aws ecs update-service --cluster $CLUSTER --service $SERVICE \
--force-new-deployment
The front is S3 sync + CloudFront invalidation. The SPA's routing is absorbed by CloudFront's "404 → /index.html (200)."
How I guaranteed observability, resilience, and idempotency
Finally, let me organize the "plain but important" designs that matter in production operation.
- Observability: record
process_timeandfailure_reasonper conversation. ECS / RDS go to CloudWatch Logs. "Slow / missed" can be traced per conversation. - Resilience: don't drop generation failures/moderation rejections with an exception, but always converge them into one record with
failure_reason. Always return some audio to the visitor (don't let it freeze silently). - Idempotency: since async results are written to
background_task_resultskeyed bytaskId, double generation doesn't occur with polling re-sends or network retries. - Testability: deterministic logic like product-number extraction/normalization is carved out into pure functions and can be unit-tested without external APIs/DB. Having separated the "ambiguous LLM part" and the "strict logic part" also pays off in the test strategy.
Summary: a demo is talent, production is design
Anyone can now make a demo of voice customer service. What makes the difference is how you cross "the five walls of production" by design.
- Latency is crossed by immediate HTTP return (submit-and-poll) + parallel fan-out along the dependency graph.
- Accuracy is crossed by a hybrid of generation (fluency) and rules (product-number master matching).
- Safety is crossed by placing an input-moderation gate before generation.
- Improvement is crossed by a loop where you leave the search basis and teacher data, and operation produces learning data.
- Operation is crossed by purpose-specific two-layer authentication, and reproducible infrastructure / zero-downtime deploy with Terraform.
The value of a generative-AI product isn't "calling a smart model" but the design that meshes the model's ambiguity with the strictness of the business. Whether you can do that carefully is, I believe, the watershed that turns a PoC into production.
How I actually built this voice-customer-service kiosk and what technology judgments I made are introduced in detail in the case study. If you want to put generative AI on the business without ending it at "verification," please feel free to consult me.