# Building real-time AI-avatar customer service with MuseTalk — production streaming design for ASR→LLM→TTS→lip-sync

> A practical guide to designing for production a conversational AI avatar / digital human that uses MuseTalk as the 'mouth' and converses via ASR (Whisper) → LLM (Claude) → TTS → lip-sync. It shows, in real code, type-safe orchestration covering low latency via avatar pre-generation, streaming concatenation of TTS and lip-sync, and interruption (barge-in), the idle loop, the latency budget, idempotency, and observability.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: MuseTalk, デジタルヒューマン, AIアバター, リアルタイム, リップシンク, TypeScript, 生成AI, 音声AI
- URL: https://tomodahinata.com/en/blog/musetalk-realtime-ai-avatar-llm-tts-digital-human
- Category: Lip-sync & digital humans
- Pillar guide: https://tomodahinata.com/en/blog/ai-lip-sync-talking-head-model-selection-guide-2026

## Key points

- MuseTalk suits the AI avatar's 'mouth' because it's real-time with single-step generation that doesn't diffuse, and the avatar can be pre-baked and reused. Being able to refund the preprocessing is the key to low latency.
- The perceived responsiveness is decided not by generation fps but by a streaming design that 'starts speaking on the first TTS chunk.' Without waiting for full-text synthesis, pipeline ASR→LLM→TTS→lip-sync per chunk.
- What's mandatory in production is interruption (barge-in), the idle loop, the latency budget, and fallback. Stop instantly on user speech, play idle footage when there's no speech, and draw a timeout on each stage.
- Type-check all boundaries with Zod, and manage turn processing with a state machine. With an idempotency key and structured logs (PII excluded), prevent double generation while being able to trace 'which turn got stuck.'
- It extends to reception, customer service, customer support, AI announcers, and education. Not a working demo, but building a setup that doesn't crash, can be interrupted, and can be stopped leads to orders.

---

## The goal of this article

"A person inside the screen **answers your question on the spot and speaks**" — that's an AI avatar (digital human) for reception, customer service, and customer support. For the **"mouth" part** that makes this work, [MuseTalk](/blog/musetalk-realtime-lip-sync-production-guide) is one of the best choices.

However, MuseTalk on its own is only "audio → mouth." **What actually produces value in a project is when you connect it with ASR, LLM, and TTS into a conversation loop.** This article shows that **overall design** — low latency via avatar pre-generation, the **streaming concatenation** of TTS and lip-sync, **interruption (barge-in)**, the **idle loop**, and the **latency budget** — in **type-safe real code.** When you finish reading, the aim is a state that avoids "the demo works but in production it can't be interrupted / doesn't stop / crashes."

> **About the author (disclosure of credibility)**: I **single-handedly designed, implemented, and run in production an AI-video-localization platform** that fully automates source separation → transcription → translation → dubbing → mouth synchronization. The lip-sync stage evolved Wav2Lip family → MuseTalk → [LatentSync](/blog/latentsync-lip-sync-diffusion-model-production-guide), and MuseTalk I've adopted for **conversational use cases where real-time performance and avatar reuse pay off.** The latency design and interruption handling in this article are a record of the points I got stuck on in that actual operation.

---

## 30-second summary (conclusion first)

| Point | Conclusion |
| --- | --- |
| **Why MuseTalk** | **Real-time** with single-step generation that doesn't diffuse. **Pre-bake the avatar → reuse** to refund preprocessing |
| **The true nature of perceived responsiveness** | Decided not by generation fps but by a streaming design that **starts speaking on the first TTS chunk** |
| **Overall composition** | User audio → **ASR (Whisper)** → **LLM (Claude)** → **TTS (streaming)** → **MuseTalk** → delivery |
| **Production must-haves** | **Interruption (barge-in), idle loop, latency budget, fallback, idempotency, observability** |
| **Type safety** | LLM/TTS output is **boundary-validated with Zod**, the turn is managed with a **state machine** |
| **Suited uses** | Reception, customer service, customer support, AI announcer, education, events |
| **Honest limits** | 256×256 (use super-resolution for close-ups), overall conversation latency includes ASR/LLM/TTS |

If you want to read from "**the overall picture of the mechanism first**," continue as-is. If "**from the code**," go to [streaming orchestration](#core-design-streaming-orchestration-type-safe).

---

## Why MuseTalk suits the "mouth"

A conversational avatar's mouth lives or dies by **① being fast and ② not redoing preprocessing every time.** MuseTalk satisfies both.

- **Single-step generation**: MuseTalk isn't a diffusion model; it **fills in the lower half of the face in one shot in latent space.** Since there's no iteration, it's real-time at **30fps+ at 256×256 (V100).** For details, go to [how MuseTalk works](/blog/musetalk-realtime-lip-sync-production-guide#仕組みなぜ拡散しないのに高品質で速いのか論文準拠でやさしく).
- **Avatar reuse**: `realtime_inference` **bakes the heavy preprocessing (face detection, latent encoding) once as an "avatar,"** and thereafter **generates instantly just by swapping the audio.** In the official words, "process a new avatar once with `preparation: True`, and thereafter reuse it with `False`."

> ⚠️ **The honest premise**: "real-time 30fps+" is about **generation throughput.** The **overall conversation latency** includes ASR→LLM→TTS. So this article spends time not on "choose a fast model" but on **"subduing latency by design."**

---

## Overall architecture

The minimal composition is this. The crux is that each stage is connected by **streaming.**

```text
[user speech (mic)]
   │  ① ASR: streaming transcription (Whisper family)
   ▼   → /blog/openai-whisper-production-guide-selfhost-vs-api
[confirmed text / partial text]
   │  ② LLM: generate the reply as a token stream (Claude)
   ▼   → /blog/claude-api-ai-sdk-v6-production-ai-features
[reply text (chunks)]
   │  ③ TTS: streaming speech synthesis per sentence
   ▼
[audio chunks]
   │  ④ MuseTalk: instant lip-sync onto the pre-baked avatar
   ▼
[talking avatar video (frame sequence)]
   │  ⑤ delivery: WebRTC / WebSocket / low-latency HLS
   ▼
[browser / signage]
```

The three principles of the design:

1. **Bake the avatar just once at startup** (`preparation: True`). Don't preprocess per turn.
2. **Stream across stages**: when the LLM's first sentence comes out, send it to TTS; when TTS's first chunk comes out, send it to MuseTalk. **Don't wait for the full text.**
3. **Protect all boundaries with types**: don't trust ASR/LLM/TTS output — **validate with Zod** before flowing it to the next stage.

> The foundation of this voice conversation (ASR, LLM, interruption, observability) is the same discipline as my [voice-AI sales-agent article](/blog/production-voice-ai-sales-agent-bedrock-pgvector). Think of this article as a composition that adds **"a face (lip-sync)"** to that.

---

## Latency budget: decide where you spend time first

What decides the perceived experience is **TTFW (Time To First Word: until the first sound and the mouth move).** Drawing the budget first keeps the later design judgments from wobbling. Guidelines (just a starting point for design; on the premise of updating with measurements):

| Stage | Design budget | How to cut it |
| --- | --- | --- |
| ASR (partial confirmation) | ~300ms | streaming ASR, detect the speech endpoint early with VAD |
| LLM (first token) | ~500ms | streaming, short system prompt, nearby region |
| TTS (first chunk) | ~300ms | streaming TTS, per-sentence synthesis |
| MuseTalk (first frame) | ~200ms | **avatar pre-baking**, fp16, small batch |
| delivery (initial buffer) | ~200ms | send small with WebRTC/WebSocket |

Put **total TTFW ≒ 1.5 seconds** as one target. What pays off here is **④ pre-baking** — because it refunds the preprocessing, you can get MuseTalk's first frame out early. Conversely, if even one stage is implemented to "wait for the full text," this budget collapses in an instant.

---

## Avatar runtime: bake, then reuse

First, carve out the minimal contract for the "mouth." **Separating preprocessing (prepare) and speaking (speak)** is the production crux (SRP).

```ts
// lib/avatar/runtime.ts — MuseTalkを「焼く/喋る」で抽象化（実装は差し替え可能 = DIP）
import { z } from "zod";

export const SpeakChunk = z.object({
  audioUrl: z.string().url(),
  seq: z.number().int().nonnegative(), // 並び順。順序保証に使う
});
export type SpeakChunk = z.infer<typeof SpeakChunk>;

export interface SpeakResult {
  readonly videoUrl: string; // or フレーム列のストリーム参照
  readonly seq: number;
}

export interface AvatarRuntime {
  /** 新規アバターを1回だけ前処理（preparation: True 相当）。重いので起動時に。 */
  prepare(input: { avatarId: string; sourceVideoUrl: string; bboxShift?: number }): Promise<void>;
  /** 焼き込み済みアバターに音声チャンクを当て、口を合わせる（preparation: False 相当）。 */
  speak(input: { avatarId: string; chunk: SpeakChunk }): Promise<SpeakResult>;
  /** 進行中の生成を即停止（バージイン用）。 */
  cancel(avatarId: string): void;
}
```

The implementation contents (a service-ified version of the Python `realtime_inference`) I leave to the [production-deployment article](/blog/musetalk-self-host-production-deployment-docker-gpu-autoscaling); here I concentrate on the design of **the side that uses this contract.**

---

## Core design: streaming orchestration (type-safe)

Let's build the processing of one turn. The requirements are **① stream the LLM ② flow it to TTS→lip-sync each time a sentence is complete ③ stop instantly on interruption.** Manage the turn's lifecycle with a state machine.

```ts
// lib/avatar/turn.ts — 1ターンを状態機械で管理し、段をまたいでストリーミングする
import { z } from "zod";
import type { AvatarRuntime } from "./runtime";

export type TurnState = "idle" | "thinking" | "speaking" | "interrupted" | "error";

// 各依存は「ストリームを返す関数」。型で契約を固定する（ETC: 実装を差し替えやすく）
export interface TurnDeps {
  llmStream: (userText: string, signal: AbortSignal) => AsyncIterable<string>; // トークン片
  ttsStream: (sentence: string, signal: AbortSignal) => AsyncIterable<unknown>; // 音声チャンク
  avatar: AvatarRuntime;
  avatarId: string;
  onState: (s: TurnState) => void; // 可観測性：状態遷移を外へ通知
  onClip: (r: { videoUrl: string; seq: number }) => void; // 生成された口映像を配信層へ
}

const TtsChunk = z.object({ audioUrl: z.string().url() });

/** 文の区切りでLLMストリームを文単位に束ねる（KISS: 句点/改行で割る）。 */
async function* sentences(tokens: AsyncIterable<string>): AsyncIterable<string> {
  let buf = "";
  for await (const t of tokens) {
    buf += t;
    const parts = buf.split(/(?<=[。．！？\n])/); // 文末で分割
    buf = parts.pop() ?? "";
    for (const s of parts) if (s.trim()) yield s.trim();
  }
  if (buf.trim()) yield buf.trim();
}

export async function runTurn(userText: string, deps: TurnDeps): Promise<TurnState> {
  const ac = new AbortController();
  let seq = 0;
  deps.onState("thinking");

  try {
    // ② LLMをストリーム → ③ 文が揃うたびにTTS → ④ チャンクごとにリップシンク
    for await (const sentence of sentences(deps.llmStream(userText, ac.signal))) {
      deps.onState("speaking");
      for await (const raw of deps.ttsStream(sentence, ac.signal)) {
        if (ac.signal.aborted) break;
        const { audioUrl } = TtsChunk.parse(raw); // 境界で必ず検証
        const clip = await deps.avatar.speak({
          avatarId: deps.avatarId,
          chunk: { audioUrl, seq: seq++ },
        });
        deps.onClip(clip); // 先頭から順に配信 → 全文を待たずに喋り始める
      }
    }
    deps.onState("idle");
    return "idle";
  } catch (err) {
    if (ac.signal.aborted) {
      deps.onState("interrupted");
      return "interrupted";
    }
    deps.onState("error"); // フォールバック（定型音声/字幕）へ落とすのは呼び出し側の責務
    return "error";
  }
}
```

There are two points. **① The per-sentence bundling (`sentences`)** passes the LLM's partial output to TTS immediately — don't wait for the full text. **② Running a single `AbortController` through it** — this enables the next "interruption."

---

## The three you always need in production: interruption, idle, fallback

This is what separates a demo from production.

### ① Interruption (barge-in)

People don't wait for the avatar to finish speaking. **The moment the user starts talking, the avatar needs to go silent immediately.**

```ts
// lib/avatar/session.ts — セッションは「今のターン」を1つだけ持ち、新発話で前ターンを殺す
import { runTurn, type TurnDeps, type TurnState } from "./turn";

export class AvatarSession {
  private current: AbortController | null = null;
  constructor(private readonly deps: Omit<TurnDeps, "onState">) {}

  /** ユーザーが話し始めた瞬間に呼ぶ：進行中のターンを即停止（バージイン）。 */
  bargeIn(): void {
    this.current?.abort();
    this.deps.avatar.cancel(this.deps.avatarId); // GPU側の生成も止めてコストを無駄にしない
    this.current = null;
  }

  async handle(userText: string): Promise<TurnState> {
    this.bargeIn(); // 新しい発話は、常に前のターンを上書きする
    const ac = new AbortController();
    this.current = ac;
    return runTurn(userText, {
      ...this.deps,
      onState: () => {}, // このセッションは状態通知を購読しない（必要なら deps に onState を渡す設計に）
    });
  }
}
```

It's important that `cancel` **also stops the GPU-side generation.** If you don't stop it, you'll **keep generating footage no one will watch anymore and throw away GPU time** (directly tied to cost).

### ② Idle loop

When there's no speech, an avatar in **complete stillness** makes people anxious — "did it freeze?" **Loop-play natural waiting footage with the mouth closed (blinking, slight sway)** and cross-fade to the lip-sync footage when speech begins. If you **pre-bake one idle clip**, the additional generation cost is zero.

### ③ Fallback

Design it so that even if one stage fails, the conversation doesn't stop.

- **TTS failure** → show subtitle text (no sound comes out, but the meaning is conveyed).
- **MuseTalk failure** → idle footage + audio only (the mouth doesn't move, but the conversation continues).
- **LLM timeout** → play a **canned clip** of "please wait a moment" to buy time.

Draw a **timeout** on each stage, and when exceeded, drop to the higher-level fallback. With this, you get the resilience that "**even if a part breaks, the whole keeps speaking.**"

---

## Idempotency, observability, cost

A conversation looks volatile, but it needs **operational quality.**

- **Idempotency**: for canned responses (like a fixed FAQ answer), **cache the already-generated clip** keyed by `sha256(avatarId + answer text)`. Don't re-create the same answer every time. The more common the question, the more it pays off.
- **Observability**: per turn, put `turnId / state transitions / TTFW / each stage's latency / fallback firings` into structured logs. **Don't emit PII (speech content, faces).** Be in a state where you can later trace "why only this turn was slow."
- **Cost**: ① zero preprocessing via avatar reuse ② stop wasteful generation instantly with barge-in ③ canned-response cache ④ fp16. The principle is **don't use the GPU during time you're not speaking.**

```ts
// 構造化ログ（PII除外）。相関IDでASR/LLM/TTS/リップシンクを横断して追える
type TurnLog = {
  turnId: string;
  avatarId: string;
  state: "idle" | "thinking" | "speaking" | "interrupted" | "error";
  ttfwMs: number | null; // 最初の音と口が動くまで
  fallback: "none" | "subtitle" | "audio_only" | "filler";
  // ❌ 入れない：userText, 音声バイト, 顔画像
};
```

---

## In what scenes to use it (applications)

The nature of pre-baking + audio swapping pays off directly in the use cases.

- **Unmanned reception / signage customer service**: bake one avatar and answer visitors' questions with LLM + FAQ. **Canned answers are instant via cache.**
- **First-line customer support**: triage inquiries via voice conversation, and escalate only the complex ones to humans.
- **AI announcer / daily broadcast**: bake the anchor footage once, and each day's script is **just an audio swap.**
- **Education / training**: apply per-chapter audio to a lecturer avatar, and **don't re-shoot every time the script is revised.**
- **Multilingual support**: MuseTalk supports many languages including Japanese. If you switch TTS and lip-sync to match the LLM's reply language, **the same avatar does multilingual customer service.**

---

## Pitfalls and avoidance

In the order you step on them in actual operation.

1. **A "wait for the full text" implementation**: full LLM text → full TTS text → generation makes latency pile up. **Always do per-sentence streaming.**
2. **Forgetting to stop the GPU on interruption**: you keep generating unwatched footage and throw away cost. Stop the GPU side with `cancel`.
3. **The mouth blurs on close-ups**: MuseTalk is 256×256. On signage where the face is large, **use super-resolution (GFPGAN, etc.) together** ([techniques for quality improvement](/blog/musetalk-realtime-lip-sync-production-guide#本番で必ず詰まる落とし穴と回復性設計)).
4. **Material that breaks the frontal premise**: profiles and occlusion break it. Shoot the avatar material **frontal, single-face, stable lighting.**
5. **Consent / impersonation**: avatarizing a real person **requires the person's consent.** Record the use and period so it can be stopped on revocation ([the compliance chapter of the selection guide](/blog/ai-lip-sync-talking-head-model-selection-guide-2026#同意肖像権コンプライアンスエンタープライズが最初に聞くこと)).

---

## Frequently asked questions (FAQ)

**Q. Can it really do "real-time conversation"?**
A. It can. But the responsiveness depends on the **streaming design.** In addition to MuseTalk's 30fps+ generation, with **avatar pre-baking** and a design that **starts speaking on the first TTS chunk,** the aim is to keep TTFW within 1–2 seconds.

**Q. Which TTS / LLM / ASR should I use?**
A. This article doesn't depend on a specific product. For the LLM, [Claude](/blog/claude-api-ai-sdk-v6-production-ai-features); for ASR, the [Whisper family](/blog/openai-whisper-production-guide-selfhost-vs-api) is solid. For TTS, **being streaming-capable** is the top-priority condition. If you firm up the boundaries with Zod, swapping is easy.

**Q. How do I prepare the avatar material?**
A. Shoot a video of a dozen-plus seconds **frontal, single-face, stable lighting,** and bake it once with `preparation: True`. Thereafter you can reuse it by swapping audio. Prepare the idle footage at the same time.

**Q. Does it hold up even on a large signage screen?**
A. 256×256 as-is blurs on a large screen. **One stage of super-resolution on the face region** raises it to a practical range. Since all frames are heavy, in operation apply it according to the necessary quality tier.

**Q. What's the behavior when interrupted?**
A. The instant user speech is detected, `bargeIn()` **stops the previous turn and the GPU generation too,** and goes to idle. Without this, the conversation breaks down with "talking over."

**Q. How do I hold down cost?**
A. ① avatar reuse ② canned-answer cache ③ stop wasteful generation with barge-in ④ fp16. The basic is **don't use the GPU during time you're not speaking.**

---

## Summary: MuseTalk as the "mouth," subdue latency by design

The success or failure of a real-time AI avatar isn't decided by the model's speed **alone.** Only when you implement these three points — **① refund the preprocessing with MuseTalk's avatar reuse, ② connect all stages by streaming, and ③ don't stop, via interruption, idle, and fallback** — with **type-safe orchestration** does a "working demo" become "**a customer-service avatar usable in the field.**"

> I implement this article's latency design, interruption, and resilience in **an AI-video platform I actually run in production.** If you're considering building a real-time digital human or AI-avatar customer service, please see my [track record](/case-studies/ai-video-localization-lipsync) and consult me. With **one person × generative AI,** I build end-to-end from PoC to production, fast, cheap, and safe. Next, head to the foundation side that supports this "mouth" — [MuseTalk production-deployment practice](/blog/musetalk-self-host-production-deployment-docker-gpu-autoscaling).

---

## Sources / related resources

- **MuseTalk**: [GitHub](https://github.com/TMElyralab/MuseTalk) / [paper arXiv:2410.10122](https://arxiv.org/abs/2410.10122) (`realtime_inference`, avatar pre-generation, real-time)
- **Model selection**: [AI lip-sync / talking-head model-selection guide 2026](/blog/ai-lip-sync-talking-head-model-selection-guide-2026)
- **Foundation side**: [MuseTalk production-deployment practice](/blog/musetalk-self-host-production-deployment-docker-gpu-autoscaling) / [MuseTalk complete guide (mechanism, tuning)](/blog/musetalk-realtime-lip-sync-production-guide)
- **Conversation stack**: [Claude API implementation](/blog/claude-api-ai-sdk-v6-production-ai-features) / [Whisper production guide](/blog/openai-whisper-production-guide-selfhost-vs-api) / [voice-AI agent design](/blog/production-voice-ai-sales-agent-bedrock-pgvector)

※ Specs and performance get updated. Before implementing, confirm the primary sources of each official source. The latency budget is a starting point for design; **always measure in your own environment and update it.**
