Skip to main content
友田 陽大
Lip-sync & digital humans
MuseTalk
MLOps
Docker
GPU
オートスケール
Python
コスト最適化
可観測性

MuseTalk Production Deployment in Practice — Docker, GPU Serving, Autoscaling, Cost Optimization, Observability

Infrastructure design for running MuseTalk self-hosted in production. We explain — in real code — a Docker image pinning CUDA 11.7/PyTorch 2.0.1/mmcv 2.0.1, a GPU inference service that keeps the model resident, queue-driven idempotent async processing, GPU autoscaling and scale-to-zero with KEDA, cost optimization via spot GPUs/fp16/avatar caching, and GPU-metrics observability.

Published
Reading time
13 min read
Author
友田 陽大
Share

The goal of this article

Between "I got MuseTalk running on my laptop" and "running it at high volume, low latency, and low cost in production," there is a big cliff. This article shows the infrastructure-side implementation that fills that cliff — pinning dependencies with Docker, a model-resident GPU inference service, queue-driven idempotent processing, GPU autoscaling with KEDA, cost optimization with spot GPUs/fp16, and observability of GPU metrics — in a ready-to-use form.

The intended readers are those in a position to judge "the PoC passed; next, can production operations be entrusted to it?", or the engineers who implement it. By the end, the goal is a state where you have at hand a reproducible environment, a service that doesn't crash, and a cost-effective configuration.

About the author (disclosure for credibility): I single-handedly designed and implemented a video-AI localization platform (audio separation → transcription → translation → dubbing → lip-sync) and operate it as a GPU-using production pipeline. At the lip-sync stage I self-host multiple models including MuseTalk, and have shaved the unit price with spot GPUs, batch filling, and idempotent caching. This article is a record of the landmines I stepped on in that infrastructure operation.


30-second summary (conclusion first)

PointConclusion
Starting point of the environmentPin dependencies with Docker (CUDA 11.7 / PyTorch 2.0.1 / mmcv 2.0.1 / mmdet 3.1.0 / mmpose 1.1.0). Never burn out on environment setup again
Service-izationStop direct script invocation, make it a GPU inference service with resident model + avatar LRU cache
AsyncQueue-driven + idempotency keys to prevent double generation, notify completion via Webhook
RecoverabilityCatch OOM and spot interruption as a normal case, auto-recover by shrinking the batch
AutoscaleKEDA linked to queue depth. Keep low-latency warm (minReplicas≥1), batch scales to zero
CostSpot GPU + fp16 + VAD silence-skipping + avatar reuse + idempotent cache
ObservabilityBind GPU utilization, queue depth, per-job latency, and the sync score with a correlation ID

The overall architecture

Starting python -m scripts.inference "every request" for MuseTalk is the worst. Every startup runs model loading (seconds to tens of seconds), and it finishes before the GPU warms up. In production, build it like this.

[API (Next.js Route Handler)]
   │  ① Zod-validate input → idempotency key → enqueue (returns immediately)
   ▼
[Job queue (SQS / Redis Stream)]
   │
   ▼
[GPU worker (resident, multiple Pods)]   ← KEDA scales the count by queue depth
   │  ② Load the model once at startup and keep it resident. Avatars are LRU-cached
   │  ③ Generate → output to object storage → notify completion via Webhook/update
   ▼
[Object storage (S3/GCS, signed URL)]

The key is "do the heavy initialization once, then reuse it." This aligns perfectly with the philosophy of MuseTalk's realtime_inference, which pre-bakes the avatar and reuses it. Design the whole service with that philosophy.


1. Docker: pin dependencies and never set up the environment again

MuseTalk's biggest difficulty is the mmlab-family dependency hell (for details and avoidance, see Complete Installation Walkthrough). In production, bake the combination that worked into Docker to guarantee reproducibility.

# Dockerfile — 公式準拠の固定環境(CUDA 11.7 / Python 3.10 / PyTorch 2.0.1)
FROM nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

# システム依存:ffmpeg(動画I/O)、libgl1(OpenCV)
RUN apt-get update && apt-get install -y --no-install-recommends \
      python3.10 python3-pip git ffmpeg libgl1 libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# ① PyTorchはCUDA 11.7ビルドを明示(噛み合わないとCPU動作で激遅になる)
RUN pip3 install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 \
      --index-url https://download.pytorch.org/whl/cu117

# ② アプリ依存
COPY requirements.txt .
RUN pip3 install -r requirements.txt

# ③ mmlabは“mim”でこのバージョン固定(順序とバージョンが命)
RUN pip3 install -U openmim \
    && mim install "mmengine" \
    && mim install "mmcv==2.0.1" \
    && mim install "mmdet==3.1.0" \
    && mim install "mmpose==1.1.0"

COPY . .

# 重みはイメージに焼かない(巨大化&Pull遅延を避ける)。起動時にキャッシュへ取得する
EXPOSE 8000
# ヘルスチェックでモデル常駐の準備完了を確認
HEALTHCHECK --interval=30s --timeout=5s --start-period=120s --retries=3 \
  CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/healthz')" || exit 1

CMD ["python3", "-m", "uvicorn", "service.app:app", "--host", "0.0.0.0", "--port", "8000"]

🔧 The "bake weights into the image?" problem: MuseTalk's weights (VAE, Whisper, dwpose, face-parse, unet, etc.) are several GB. Baking them into the image makes Pull slow and lengthens autoscale cold starts. In production, the standard play is to keep the image light and fetch the weights from object storage or a shared volume at startup (local cache). Pre-prefetching them onto the node image is even faster.

The main pins of requirements.txt (official-conformant, don't bump on your own):

# 例。実際の必要パッケージは公式リポジトリのrequirements.txtに従う
diffusers
accelerate
opencv-python
numpy
omegaconf
transformers
# fastapi / uvicorn / boto3 などサービス用も追加
fastapi
uvicorn[standard]
boto3

2. A GPU inference service that keeps the model resident

Stop the script, and load the model once at startup and keep it resident. The point is to relocate MuseTalk's inference internals (VAE, U-Net, Whisper, face processing) onto the service's lifecycle.

# service/app.py — モデル常駐+アバターLRUキャッシュのGPU推論サービス
from contextlib import asynccontextmanager
from collections import OrderedDict
import hashlib
import torch
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, HttpUrl

# MuseTalkリポジトリの推論内部を import して薄くラップする(関数名は実装に合わせる)
from musetalk_service.engine import MuseTalkEngine  # ← 自前の薄いアダプタ層


class GenerateRequest(BaseModel):
    avatar_id: str
    source_video_url: HttpUrl
    audio_url: HttpUrl
    bbox_shift: int = 0
    use_float16: bool = True


# アバター前処理結果のLRUキャッシュ(再利用=前処理の払い戻し)
class AvatarCache:
    def __init__(self, capacity: int = 8) -> None:
        self._store: "OrderedDict[str, object]" = OrderedDict()
        self._cap = capacity

    def get(self, key: str):
        if key in self._store:
            self._store.move_to_end(key)
            return self._store[key]
        return None

    def put(self, key: str, value: object) -> None:
        self._store[key] = value
        self._store.move_to_end(key)
        if len(self._store) > self._cap:
            self._store.popitem(last=False)  # 最も使われていないアバターを退避


@asynccontextmanager
async def lifespan(app: FastAPI):
    # 起動時に1回だけモデルをGPUへ常駐ロード(ここが重い。だから1回だけ)
    app.state.engine = MuseTalkEngine(
        device="cuda",
        dtype=torch.float16,  # fp16で省VRAM・高速
        weights_dir="/cache/models",
    )
    app.state.avatars = AvatarCache(capacity=8)
    yield
    # graceful shutdown:進行中ジョブの後始末
    app.state.engine.close()


app = FastAPI(lifespan=lifespan)


@app.get("/healthz")
def healthz():
    # モデル常駐の準備完了をK8s/Dockerのヘルスチェックへ返す
    return {"ok": app.state.engine.is_ready()}


@app.post("/generate")
def generate(req: GenerateRequest):
    key = req.avatar_id
    prepared = app.state.avatars.get(key)
    if prepared is None:
        # 新規アバターは1回だけ前処理(preparation: True 相当)
        prepared = app.state.engine.prepare(str(req.source_video_url), bbox_shift=req.bbox_shift)
        app.state.avatars.put(key, prepared)

    try:
        out_url = app.state.engine.speak(prepared, str(req.audio_url))  # 即時生成
        return {"video_url": out_url, "avatar_cached": prepared is not None}
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        raise HTTPException(status_code=503, detail="gpu_oom_retry")  # 上位でバッチ縮小して再試行

💡 MuseTalkEngine is your own thin adapter. It decomposes into methods of a long-lived process what the official repository's scripts.realtime_inference does — "load weights → face detection → latent encoding (prepare)" and "audio → generation (speak)." Building this is the substance of production-ization, and the difference between "running it" and "service-izing it" is exactly this bit of extra work.


3. Queue-driven idempotent async processing

Synchronously processing a 100-second-class generation over HTTP becomes a breeding ground for timeouts and double execution. Stack it on a queue to make it async, and prevent double generation with an idempotency key.

// app/api/lipsync/route.ts — 投入は即返し、生成はワーカーへ(Next.js Route Handler)
import { NextResponse } from "next/server";
import { createHash } from "node:crypto";
import { z } from "zod";

const Req = z.object({
  avatarId: z.string().min(1),
  sourceVideoUrl: z.string().url(),
  audioUrl: z.string().url(),
  bboxShift: z.number().int().default(0),
});

function jobKey(input: z.infer<typeof Req>): string {
  return createHash("sha256").update(JSON.stringify(input)).digest("hex");
}

export async function POST(req: Request) {
  const parsed = Req.safeParse(await req.json());
  if (!parsed.success) {
    return NextResponse.json({ error: parsed.error.flatten() }, { status: 422 });
  }
  const key = jobKey(parsed.data);

  // 冪等性:同じ入力なら作り直さない(二重生成・二重課金の防止)
  const existing = await jobStore.get(key);
  if (existing) return NextResponse.json({ jobId: existing.id, cached: true });

  const job = await queue.enqueue("lipsync", { ...parsed.data, key });
  await jobStore.put(key, { id: job.id, status: "queued" });
  return NextResponse.json({ jobId: job.id, cached: false });
}

On the worker side, treat OOM and spot interruption as a normal case, not an exception.

# worker.py — 失敗を握り、回復する(少なくとも1回・冪等前提)
def handle(job: dict) -> None:
    try:
        run_generate(job, batch_size=job.get("batch_size", 8))
        mark_done(job["key"])
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        bs = max(1, job.get("batch_size", 8) // 2)
        requeue(job | {"batch_size": bs})  # バッチを半分にして積み直す
    except SpotInterruption:
        requeue(job)  # スポット中断は“正常”。別Podが拾う(冪等なので二重実行しても安全)

If you assume spot GPUs, interruptions are everyday. Design it as "interruption = re-queue, and it's safe even doubled because it's idempotent," and you can use cheap spot with peace of mind — this is where cost takes effect.


4. GPU autoscaling (KEDA: linked to queue depth)

GPUs are expensive. The iron rule is to increase only when there's a queue and decrease when there isn't. With Kubernetes + KEDA, link scaling to queue depth.

# keda-scaledobject.yaml — キュー深さでGPUワーカーをオートスケール
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: musetalk-worker
spec:
  scaleTargetRef:
    name: musetalk-worker         # GPUワーカーのDeployment
  minReplicaCount: 1              # 低遅延要件:常に1台温存しコールドスタートを消す
  maxReplicaCount: 10
  cooldownPeriod: 300            # 縮小は緩やかに(生成中の打ち切りを避ける)
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.ap-northeast-1.amazonaws.com/xxx/lipsync
        queueLength: "3"         # 1台あたり3件たまったら増やす
        awsRegion: ap-northeast-1
# deployment抜粋 — GPUを要求し、graceful shutdownの猶予を取る
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 120   # 生成中ジョブを取りこぼさない
      containers:
        - name: worker
          resources:
            limits:
              nvidia.com/gpu: 1

Using the scale strategies differently:

  • Low-latency (conversational avatar)always keep warm at minReplicaCount: 1 or more. Don't let the user feel the cold start (image Pull + model load).
  • Batch (overnight bulk localization)scale to zero (minReplicaCount: 0). Zero GPU billing while there's no work. The cold start is acceptable.

⚠️ Cold-start countermeasures: With GPU node acquisition (several minutes) + image Pull + model load, a from-zero start is on the order of minutes. If you use scale-to-zero, shorten the startup with image pre-prefetching onto the node image and a permanent volume for the model cache.


5. Cost optimization: 5 moves to shave the unit price

MuseTalk is fast to begin with, so cost is decided by a design that eliminates wasted GPU time.

  1. Spot/preemptible GPU: with a design that absorbs interruptions via re-queue, a fraction of on-demand.
  2. fp16 (use_float16): directly tied to speed and VRAM. Always enable for drafts.
  3. Skip silence with VAD: in non-speaking intervals there's no need to move the mouth. Just not running it through generation reduces GPU time.
  4. Avatar reuse + LRU cache: refund the preprocessing. The 2nd-and-later run of the same avatar is fast = cheap.
  5. Idempotent cache: zero out re-generation of the same input. The more repetitive (like a canned FAQ response), the more it helps.

A rough break-even guide: small volume → API (zero environment), steady high volume → self-host. If you fill batches with spot GPUs, the per-piece unit price easily falls below the API. For details on the decision axes, see the TCO chapter of the selection guide.


6. Observability and quality gates

Get to a state where you can state "it's working" in numbers. GPUs especially can't be operated unless you can see whether they're warmed up and whether they're clogged.

Metrics to monitor:

MetricWhy watch it
GPU utilization / VRAMDetect idle (wasted money) / exhaustion (a sign before OOM)
Queue depth / wait timeValidity of scaling, a precursor to SLA violation
Per-job latency (split prepare/speak)Where's the slowness? Is avatar reuse taking effect?
Failure rate / re-queue rateFrequency of OOM and spot interruption
Sync score (SyncNet)Machine scoring of quality. Don't rely on visual inspection
# 構造化ログ(PII除外)。相関IDで API→キュー→ワーカー→出力 を横断
log.info("lipsync_done", extra={
    "job_id": job_id,
    "avatar_id": avatar_id,
    "avatar_cache_hit": cache_hit,      # 再利用が効いたか
    "prepare_ms": prepare_ms,           # 前処理(初回のみ重い)
    "speak_ms": speak_ms,               # 生成
    "sync_score": sync_score,           # 品質ゲートのしきい値判定に使う
    "gpu_mem_peak_mb": peak_mem,
    # ❌ 入れない:音声内容・顔画像・元動画
})

Quality gate: MuseTalk's model tree includes syncnet (latentsync_syncnet.pt). Machine-score the output by synchronization degree, and send only those below the threshold to human review/re-generation. Build this into CI/post-processing, and even in large batches, broken cuts don't slip through. Quality assurance by visual inspection alone doesn't withstand scale.


Security

Even for a GPU service, the basics don't change.

  • Don't bake secrets into the image: API keys and cloud credentials come from environment variables / a secrets manager. Don't include .env or key files in the image.
  • Input/output via signed URLs: exchange input video/audio and output via time-limited signed URLs. Don't make the bucket public.
  • Validate at the boundary with Zod/Pydantic: always validate external input (URLs, parameters). Reject an out-of-range bbox_shift, etc.
  • Don't retain PII: faces and voices are personal data. Design a retention period and don't emit content to logs.
  • Least privilege: narrow the worker's IAM to "read/write this bucket" only.

Frequently Asked Questions (FAQ)

Q. Why is direct script invocation no good? A. Because model loading (seconds to tens of seconds) runs every request, and it finishes before the GPU warms up. Service-izing it as resident, doing initialization once, and reusing avatars via caching is the premise of production.

Q. Should weights be baked into the image? A. As a rule, don't. Several-GB weights bloat the image and make the autoscale Pull slow. The standard play is fetch from object storage / a shared volume at startup + cache. If you use scale-to-zero, also combine pre-prefetching onto the node.

Q. Isn't spot GPU scary? A. With a design of "interruption = re-queue, and safe even if doubled because it's idempotent," it's not scary. Rather, it's the move that lowers cost the most. It's only dangerous to use spot without interruption handling.

Q. Can low latency and cost be achieved together? A. Splitting by requirement is the right answer. Keep the conversational avatar warm with minReplicas≥1 (eliminating the cold start), and scale the overnight batch to zero (zero billing). Operate the same worker with two scale policies.

Q. Which queue/orchestrator should I use? A. This article illustrated with SQS+KEDA, but the philosophy is the same with Redis Stream, Cloud Tasks, or Pub/Sub. The essence is to scale by queue depth and process idempotently.

Q. Is the design the same for on-prem GPUs? A. Yes. KEDA works on K8s (on-prem) too. Treat node preemption/maintenance as an interruption instead of spot, and the design can be reused as-is.


Summary: bake reproducibility, recoverability, and cost into the design

MuseTalk's production deployment is decided not by the model's quality but by the craftsmanship of the infrastructure.

  1. Pin dependencies with Docker (CUDA 11.7 / PyTorch 2.0.1 / mmcv 2.0.1 …) — reproducibility.
  2. Resident model + avatar cache — eliminate cold starts and preprocessing.
  3. Queue-driven + idempotent + OOM/interruption recovery — it doesn't crash.
  4. Queue-linked scaling with KEDA, spot + fp16 + VAD — cheap.
  5. GPU metrics + a sync-score quality gate — don't miss breakdowns.

Only after building this far does "it ran on 1 machine" become "it runs on any number, doesn't crash, and runs cheap."

I actually implement this article's Docker, GPU serving, spot operation, and quality gate in a production GPU pipeline. If you're considering the production deployment, cost optimization, and operational design of a MuseTalk/lip-sync platform, please consult us after looking at the case study. With one person × generative AI, I build everything end-to-end from PoC to production operation — fast, cheap, and safe.


※ Versions, GPU prices, and various limits get updated. Before going to production, verify with primary sources and actual measurement in your own environment. Adjust the Docker base image and package versions following the official requirements.txt.

友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading