# MuseTalk Production Deployment in Practice — Docker, GPU Serving, Autoscaling, Cost Optimization, Observability

> Infrastructure design for running MuseTalk self-hosted in production. We explain — in real code — a Docker image pinning CUDA 11.7/PyTorch 2.0.1/mmcv 2.0.1, a GPU inference service that keeps the model resident, queue-driven idempotent async processing, GPU autoscaling and scale-to-zero with KEDA, cost optimization via spot GPUs/fp16/avatar caching, and GPU-metrics observability.

- Published: 2026-06-25
- Author: 友田 陽大
- Tags: MuseTalk, MLOps, Docker, GPU, オートスケール, Python, コスト最適化, 可観測性
- URL: https://tomodahinata.com/en/blog/musetalk-self-host-production-deployment-docker-gpu-autoscaling
- Category: Lip-sync & digital humans
- Pillar guide: https://tomodahinata.com/en/blog/ai-lip-sync-talking-head-model-selection-guide-2026

## Key points

- Pinning MuseTalk's dependencies (CUDA 11.7/PyTorch 2.0.1/mmcv 2.0.1/mmdet 3.1.0/mmpose 1.1.0) in Docker is the starting point of stability. Never do a non-reproducible environment setup again
- In production, don't invoke a script directly — make it a GPU inference service that resident-loads the model at startup and LRU-caches avatars. Refund the preprocessing, eliminate the cold start
- Prevent double generation with queue-driven + idempotency keys, and notify completion via Webhook. Catch OOM and preemption as a normal case, and auto-recover by shrinking the batch
- Autoscale with KEDA linked to queue depth. For low-latency requirements keep warm with minReplicas≥1; batch scales to zero. Shave the unit price with spot GPU + fp16 + VAD silence-skipping
- For observability, bind GPU utilization, queue depth, per-job latency, and the sync score with a correlation ID. Don't emit PII, and build a quality gate into CI to stop broken cuts

---

## The goal of this article

Between "I got [MuseTalk](/blog/musetalk-realtime-lip-sync-production-guide) running on my laptop" and "**running it at high volume, low latency, and low cost in production**," there is a big cliff. This article shows the **infrastructure-side implementation** that fills that cliff — **pinning dependencies with Docker, a model-resident GPU inference service, queue-driven idempotent processing, GPU autoscaling with KEDA, cost optimization with spot GPUs/fp16, and observability of GPU metrics** — in a ready-to-use form.

The intended readers are those in a position to judge "the PoC passed; next, can production operations be entrusted to it?", or the engineers who implement it. By the end, the goal is a state where you have at hand a **reproducible environment, a service that doesn't crash, and a cost-effective configuration.**

> **About the author (disclosure for credibility):** I **single-handedly designed and implemented** a video-AI localization platform (audio separation → transcription → translation → dubbing → lip-sync) and **operate it as a GPU-using production pipeline.** At the lip-sync stage I self-host multiple models including MuseTalk, and have shaved the unit price with **spot GPUs, batch filling, and idempotent caching.** This article is a **record of the landmines I stepped on in that infrastructure operation.**

---

## 30-second summary (conclusion first)

| Point | Conclusion |
| --- | --- |
| **Starting point of the environment** | **Pin dependencies with Docker** (CUDA 11.7 / PyTorch 2.0.1 / mmcv 2.0.1 / mmdet 3.1.0 / mmpose 1.1.0). Never burn out on environment setup again |
| **Service-ization** | Stop direct script invocation, make it a GPU inference service with **resident model + avatar LRU cache** |
| **Async** | **Queue-driven + idempotency keys** to prevent double generation, notify completion via Webhook |
| **Recoverability** | Catch OOM and spot interruption as a **normal case**, auto-recover by shrinking the batch |
| **Autoscale** | **KEDA** linked to queue depth. Keep low-latency warm (minReplicas≥1), batch scales to zero |
| **Cost** | **Spot GPU + fp16 + VAD silence-skipping + avatar reuse + idempotent cache** |
| **Observability** | Bind GPU utilization, queue depth, per-job latency, and the **sync score** with a correlation ID |

---

## The overall architecture

Starting `python -m scripts.inference` "every request" for MuseTalk is **the worst.** Every startup runs **model loading (seconds to tens of seconds)**, and it finishes before the GPU warms up. In production, build it like this.

```text
[API (Next.js Route Handler)]
   │  ① Zod-validate input → idempotency key → enqueue (returns immediately)
   ▼
[Job queue (SQS / Redis Stream)]
   │
   ▼
[GPU worker (resident, multiple Pods)]   ← KEDA scales the count by queue depth
   │  ② Load the model once at startup and keep it resident. Avatars are LRU-cached
   │  ③ Generate → output to object storage → notify completion via Webhook/update
   ▼
[Object storage (S3/GCS, signed URL)]
```

The key is **"do the heavy initialization once, then reuse it."** This aligns perfectly with the philosophy of MuseTalk's `realtime_inference`, which **pre-bakes the avatar and reuses it.** Design the whole service with that philosophy.

---

## 1. Docker: pin dependencies and never set up the environment again

MuseTalk's biggest difficulty is the **mmlab-family dependency hell** (for details and avoidance, see [Complete Installation Walkthrough](/blog/musetalk-installation-troubleshooting-mmcv-mmdet-mmpose-cuda)). In production, **bake the combination that worked into Docker to guarantee reproducibility.**

```dockerfile
# Dockerfile — 公式準拠の固定環境（CUDA 11.7 / Python 3.10 / PyTorch 2.0.1）
FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

# システム依存：ffmpeg（動画I/O）、libgl1（OpenCV）
RUN apt-get update && apt-get install -y --no-install-recommends \
      python3.10 python3-pip git ffmpeg libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# ① PyTorchはCUDA 11.7ビルドを明示（噛み合わないとCPU動作で激遅になる）
RUN pip3 install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 \
      --index-url https://download.pytorch.org/whl/cu117

# ② アプリ依存
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# ③ mmlabは“mim”でこのバージョン固定（順序とバージョンが命）
RUN pip3 install -U openmim \
    && mim install "mmengine" \
    && mim install "mmcv==2.0.1" \
    && mim install "mmdet==3.1.0" \
    && mim install "mmpose==1.1.0"

COPY . .

# 重みはイメージに焼かない（巨大化＆Pull遅延を避ける）。起動時にキャッシュへ取得する
EXPOSE 8000
# ヘルスチェックでモデル常駐の準備完了を確認
HEALTHCHECK --interval=30s --timeout=5s --start-period=120s --retries=3 \
  CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/healthz')" || exit 1

CMD ["python3", "-m", "uvicorn", "service.app:app", "--host", "0.0.0.0", "--port", "8000"]
```

> 🔧 **The "bake weights into the image?" problem**: MuseTalk's weights (VAE, Whisper, dwpose, face-parse, unet, etc.) are **several GB.** Baking them into the image makes **Pull slow and lengthens autoscale cold starts.** In production, the standard play is to **keep the image light and fetch the weights from object storage or a shared volume at startup (local cache).** Pre-prefetching them onto the node image is even faster.

The main pins of `requirements.txt` (official-conformant, don't bump on your own):

```text
# 例。実際の必要パッケージは公式リポジトリのrequirements.txtに従う
diffusers
accelerate
opencv-python
numpy
omegaconf
transformers
# fastapi / uvicorn / boto3 などサービス用も追加
fastapi
uvicorn[standard]
boto3
```

---

## 2. A GPU inference service that keeps the model resident

Stop the script, and **load the model once at startup and keep it resident.** The point is to **relocate MuseTalk's inference internals (VAE, U-Net, Whisper, face processing) onto the service's lifecycle.**

```python
# service/app.py — モデル常駐＋アバターLRUキャッシュのGPU推論サービス
from contextlib import asynccontextmanager
from collections import OrderedDict
import hashlib
import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, HttpUrl

# MuseTalkリポジトリの推論内部を import して薄くラップする（関数名は実装に合わせる）
from musetalk_service.engine import MuseTalkEngine  # ← 自前の薄いアダプタ層


class GenerateRequest(BaseModel):
    avatar_id: str
    source_video_url: HttpUrl
    audio_url: HttpUrl
    bbox_shift: int = 0
    use_float16: bool = True


# アバター前処理結果のLRUキャッシュ（再利用＝前処理の払い戻し）
class AvatarCache:
    def __init__(self, capacity: int = 8) -> None:
        self._store: "OrderedDict[str, object]" = OrderedDict()
        self._cap = capacity

    def get(self, key: str):
        if key in self._store:
            self._store.move_to_end(key)
            return self._store[key]
        return None

    def put(self, key: str, value: object) -> None:
        self._store[key] = value
        self._store.move_to_end(key)
        if len(self._store) > self._cap:
            self._store.popitem(last=False)  # 最も使われていないアバターを退避


@asynccontextmanager
async def lifespan(app: FastAPI):
    # 起動時に1回だけモデルをGPUへ常駐ロード（ここが重い。だから1回だけ）
    app.state.engine = MuseTalkEngine(
        device="cuda",
        dtype=torch.float16,  # fp16で省VRAM・高速
        weights_dir="/cache/models",
    )
    app.state.avatars = AvatarCache(capacity=8)
    yield
    # graceful shutdown：進行中ジョブの後始末
    app.state.engine.close()


app = FastAPI(lifespan=lifespan)


@app.get("/healthz")
def healthz():
    # モデル常駐の準備完了をK8s/Dockerのヘルスチェックへ返す
    return {"ok": app.state.engine.is_ready()}


@app.post("/generate")
def generate(req: GenerateRequest):
    key = req.avatar_id
    prepared = app.state.avatars.get(key)
    if prepared is None:
        # 新規アバターは1回だけ前処理（preparation: True 相当）
        prepared = app.state.engine.prepare(str(req.source_video_url), bbox_shift=req.bbox_shift)
        app.state.avatars.put(key, prepared)

    try:
        out_url = app.state.engine.speak(prepared, str(req.audio_url))  # 即時生成
        return {"video_url": out_url, "avatar_cached": prepared is not None}
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        raise HTTPException(status_code=503, detail="gpu_oom_retry")  # 上位でバッチ縮小して再試行
```

> 💡 `MuseTalkEngine` is **your own thin adapter.** It **decomposes into methods of a long-lived process** what the official repository's `scripts.realtime_inference` does — "load weights → face detection → latent encoding (prepare)" and "audio → generation (speak)." Building this is the substance of production-ization, and the difference between "running it" and "service-izing it" is exactly this bit of extra work.

---

## 3. Queue-driven idempotent async processing

Synchronously processing a 100-second-class generation over HTTP becomes a breeding ground for timeouts and double execution. **Stack it on a queue to make it async**, and prevent double generation with an **idempotency key.**

```ts
// app/api/lipsync/route.ts — 投入は即返し、生成はワーカーへ（Next.js Route Handler）
import { NextResponse } from "next/server";
import { createHash } from "node:crypto";
import { z } from "zod";

const Req = z.object({
  avatarId: z.string().min(1),
  sourceVideoUrl: z.string().url(),
  audioUrl: z.string().url(),
  bboxShift: z.number().int().default(0),
});

function jobKey(input: z.infer<typeof Req>): string {
  return createHash("sha256").update(JSON.stringify(input)).digest("hex");
}

export async function POST(req: Request) {
  const parsed = Req.safeParse(await req.json());
  if (!parsed.success) {
    return NextResponse.json({ error: parsed.error.flatten() }, { status: 422 });
  }
  const key = jobKey(parsed.data);

  // 冪等性：同じ入力なら作り直さない（二重生成・二重課金の防止）
  const existing = await jobStore.get(key);
  if (existing) return NextResponse.json({ jobId: existing.id, cached: true });

  const job = await queue.enqueue("lipsync", { ...parsed.data, key });
  await jobStore.put(key, { id: job.id, status: "queued" });
  return NextResponse.json({ jobId: job.id, cached: false });
}
```

On the worker side, treat **OOM and spot interruption as a normal case, not an exception.**

```python
# worker.py — 失敗を握り、回復する（少なくとも1回・冪等前提）
def handle(job: dict) -> None:
    try:
        run_generate(job, batch_size=job.get("batch_size", 8))
        mark_done(job["key"])
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        bs = max(1, job.get("batch_size", 8) // 2)
        requeue(job | {"batch_size": bs})  # バッチを半分にして積み直す
    except SpotInterruption:
        requeue(job)  # スポット中断は“正常”。別Podが拾う（冪等なので二重実行しても安全）
```

If you assume spot GPUs, **interruptions are everyday.** Design it as "interruption = re-queue, and it's safe even doubled because it's idempotent," and you can **use cheap spot with peace of mind** — this is where cost takes effect.

---

## 4. GPU autoscaling (KEDA: linked to queue depth)

GPUs are expensive. The iron rule is to **increase only when there's a queue and decrease when there isn't.** With Kubernetes + KEDA, link scaling to **queue depth.**

```yaml
# keda-scaledobject.yaml — キュー深さでGPUワーカーをオートスケール
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: musetalk-worker
spec:
  scaleTargetRef:
    name: musetalk-worker         # GPUワーカーのDeployment
  minReplicaCount: 1              # 低遅延要件：常に1台温存しコールドスタートを消す
  maxReplicaCount: 10
  cooldownPeriod: 300            # 縮小は緩やかに（生成中の打ち切りを避ける）
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.ap-northeast-1.amazonaws.com/xxx/lipsync
        queueLength: "3"         # 1台あたり3件たまったら増やす
        awsRegion: ap-northeast-1
```

```yaml
# deployment抜粋 — GPUを要求し、graceful shutdownの猶予を取る
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 120   # 生成中ジョブを取りこぼさない
      containers:
        - name: worker
          resources:
            limits:
              nvidia.com/gpu: 1
```

**Using the scale strategies differently:**

- **Low-latency (conversational avatar)** → **always keep warm** at `minReplicaCount: 1` or more. Don't let the user feel the cold start (image Pull + model load).
- **Batch (overnight bulk localization)** → **scale to zero** (`minReplicaCount: 0`). Zero GPU billing while there's no work. The cold start is acceptable.

> ⚠️ **Cold-start countermeasures**: With GPU node acquisition (several minutes) + image Pull + model load, a from-zero start is **on the order of minutes.** If you use scale-to-zero, shorten the startup with **image pre-prefetching onto the node image** and a **permanent volume for the model cache.**

---

## 5. Cost optimization: 5 moves to shave the unit price

MuseTalk is fast to begin with, so cost is decided by a design that **eliminates wasted GPU time.**

1. **Spot/preemptible GPU**: with a design that absorbs interruptions via re-queue, **a fraction of on-demand.**
2. **fp16 (`use_float16`)**: directly tied to speed and VRAM. Always enable for drafts.
3. **Skip silence with VAD**: in non-speaking intervals there's no need to move the mouth. Just **not running it through generation** reduces GPU time.
4. **Avatar reuse + LRU cache**: refund the preprocessing. The 2nd-and-later run of the same avatar is fast = cheap.
5. **Idempotent cache**: zero out re-generation of the same input. The more **repetitive** (like a canned FAQ response), the more it helps.

> **A rough break-even guide**: small volume → API (zero environment), steady high volume → self-host. If you fill batches with spot GPUs, the per-piece unit price easily falls below the API. For details on the decision axes, see [the TCO chapter of the selection guide](/blog/ai-lip-sync-talking-head-model-selection-guide-2026#tcoapi-vs-セルフホストの損益分岐).

---

## 6. Observability and quality gates

Get to a state where you can state "it's working" **in numbers.** GPUs especially can't be operated unless you can see **whether they're warmed up and whether they're clogged.**

Metrics to monitor:

| Metric | Why watch it |
| --- | --- |
| GPU utilization / VRAM | Detect idle (wasted money) / exhaustion (a sign before OOM) |
| Queue depth / wait time | Validity of scaling, a precursor to SLA violation |
| Per-job latency (split prepare/speak) | Where's the slowness? Is avatar reuse taking effect? |
| Failure rate / re-queue rate | Frequency of OOM and spot interruption |
| **Sync score (SyncNet)** | **Machine scoring of quality.** Don't rely on visual inspection |

```python
# 構造化ログ（PII除外）。相関IDで API→キュー→ワーカー→出力 を横断
log.info("lipsync_done", extra={
    "job_id": job_id,
    "avatar_id": avatar_id,
    "avatar_cache_hit": cache_hit,      # 再利用が効いたか
    "prepare_ms": prepare_ms,           # 前処理（初回のみ重い）
    "speak_ms": speak_ms,               # 生成
    "sync_score": sync_score,           # 品質ゲートのしきい値判定に使う
    "gpu_mem_peak_mb": peak_mem,
    # ❌ 入れない：音声内容・顔画像・元動画
})
```

**Quality gate**: MuseTalk's model tree includes `syncnet` (`latentsync_syncnet.pt`). **Machine-score the output by synchronization degree**, and send **only those below the threshold** to human review/re-generation. Build this into CI/post-processing, and **even in large batches, broken cuts don't slip through.** Quality assurance by visual inspection alone doesn't withstand scale.

---

## Security

Even for a GPU service, the basics don't change.

- **Don't bake secrets into the image**: API keys and cloud credentials come from environment variables / a secrets manager. Don't include `.env` or key files in the image.
- **Input/output via signed URLs**: exchange input video/audio and output via **time-limited signed URLs.** Don't make the bucket public.
- **Validate at the boundary with Zod/Pydantic**: always validate external input (URLs, parameters). Reject an out-of-range `bbox_shift`, etc.
- **Don't retain PII**: faces and voices are personal data. Design a **retention period** and don't emit content to logs.
- **Least privilege**: narrow the worker's IAM to "read/write this bucket" only.

---

## Frequently Asked Questions (FAQ)

**Q. Why is direct script invocation no good?**
A. Because **model loading (seconds to tens of seconds)** runs every request, and it finishes before the GPU warms up. **Service-izing it as resident**, doing initialization once, and reusing avatars via caching is the premise of production.

**Q. Should weights be baked into the image?**
A. As a rule, **don't.** Several-GB weights bloat the image and make the autoscale Pull slow. The standard play is **fetch from object storage / a shared volume at startup + cache.** If you use scale-to-zero, also combine pre-prefetching onto the node.

**Q. Isn't spot GPU scary?**
A. With a design of "**interruption = re-queue, and safe even if doubled because it's idempotent**," it's not scary. Rather, it's **the move that lowers cost the most.** It's only dangerous to use spot without interruption handling.

**Q. Can low latency and cost be achieved together?**
A. **Splitting by requirement** is the right answer. Keep the conversational avatar warm with `minReplicas≥1` (eliminating the cold start), and scale the overnight batch to zero (zero billing). Operate the same worker with two scale policies.

**Q. Which queue/orchestrator should I use?**
A. This article illustrated with SQS+KEDA, but the philosophy is the same with Redis Stream, Cloud Tasks, or Pub/Sub. The essence is to **scale by queue depth and process idempotently.**

**Q. Is the design the same for on-prem GPUs?**
A. Yes. KEDA works on K8s (on-prem) too. Treat **node preemption/maintenance** as an interruption instead of spot, and the design can be reused as-is.

---

## Summary: bake reproducibility, recoverability, and cost into the design

MuseTalk's production deployment is decided not by the model's quality but by the **craftsmanship of the infrastructure.**

1. **Pin dependencies with Docker** (CUDA 11.7 / PyTorch 2.0.1 / mmcv 2.0.1 …) — reproducibility.
2. **Resident model + avatar cache** — eliminate cold starts and preprocessing.
3. **Queue-driven + idempotent + OOM/interruption recovery** — it doesn't crash.
4. **Queue-linked scaling with KEDA, spot + fp16 + VAD** — cheap.
5. **GPU metrics + a sync-score quality gate** — don't miss breakdowns.

Only after building this far does "**it ran on 1 machine**" become "**it runs on any number, doesn't crash, and runs cheap.**"

> I **actually implement** this article's Docker, GPU serving, spot operation, and quality gate **in a production GPU pipeline.** If you're considering the **production deployment, cost optimization, and operational design** of a MuseTalk/lip-sync platform, please consult us after looking at the [case study](/case-studies/ai-video-localization-lipsync). With **one person × generative AI**, I build everything end-to-end from PoC to production operation — fast, cheap, and safe.

---

## Sources & related resources

- **MuseTalk**: [GitHub](https://github.com/TMElyralab/MuseTalk) (dependency versions, `realtime_inference`, model tree, `syncnet`)
- **Environment setup details**: [MuseTalk Installation Complete Walkthrough (mmcv/mmdet/mmpose)](/blog/musetalk-installation-troubleshooting-mmcv-mmdet-mmpose-cuda)
- **Usage & tuning**: [MuseTalk Complete Guide](/blog/musetalk-realtime-lip-sync-production-guide)
- **Model selection & TCO**: [AI Lip-Sync / Talking-Head Model Selection Guide 2026](/blog/ai-lip-sync-talking-head-model-selection-guide-2026)
- **Application (conversational avatar)**: [Building Real-Time AI Avatar Customer Service with MuseTalk](/blog/musetalk-realtime-ai-avatar-llm-tts-digital-human)

※ Versions, GPU prices, and various limits get updated. **Before going to production, verify with primary sources and actual measurement in your own environment.** Adjust the Docker base image and package versions following the official `requirements.txt`.
