Skip to main content
友田 陽大
Lip-sync & digital humans
LatentSync
リップシンク
AI動画
拡散モデル
Python
GPU
生成AI
MLOps

LatentSync Complete Guide: Running ByteDance's Diffusion Lip-Sync Model in Production, Faithful to the Official Docs

An explanation of ByteDance's audio-conditioned latent-diffusion lip-sync model LatentSync, faithful to the official documentation (GitHub, paper, HuggingFace). The mechanism of the latest 1.6, both the Replicate API and self-hosting procedures, tuning of inference_steps/guidance_scale, and resilience design against face-detection failure / OOM / audio drift — the implementation needed for production shown with concrete code.

Published
Reading time
24 min read
Author
友田 陽大
Share

The Goal of This Article

LatentSync is a lip-sync diffusion model that generates mouth movement from audio, published by ByteDance. The GitHub repository's name directly expresses its philosophy — "Taming Stable Diffusion for Lip Sync!"

This article, while strictly based on the official documentation (GitHub / paper / HuggingFace), fills in — with actually-running code — "in which scene, how to use it, and where it clogs," which isn't written in the official README. By the time you finish reading, I aim for a state where you can do the following 3 things.

  1. Explain to others what kind of model LatentSync is, and why its quality is high.
  2. Judge which of the Replicate API (no own GPU needed) and self-hosting to choose, and get your hands moving today.
  3. Assemble a resilient implementation that withstands not a demo but production — face-detection failure, long-video OOM, audio drift.

About the author (disclosure for credibility): I have single-handedly designed, implemented, and run in production an AI video-localization platform that fully automates, from just uploading a video, everything from "audio separation → transcription → translation → multilingual dubbing → lip-sync." The implementation protagonist of its 5th stage (lip-sync) moved through Wav2Lip-family → MuseTalk to LatentSync. This article's "pitfalls" and "resilience design" are not demo knowledge but a record of the mines I stepped on in that real operation. The whole pipeline's design is in a separate article, and the project overview is compiled at the portfolio link at the end of this article.


A 30-Second Summary (Conclusion First)

ViewpointConclusion
What kind of modelA diffusion model that, from 1 talking video + arbitrary audio, generates "a video with the mouth re-made as if speaking that audio."
What's impressiveWith SyncNet supervision + TREPA, it achieves both "audio and mouth match yet natural." The paper reports it surpasses the then-SOTA on HDTF/VoxCeleb2
Latest versionLatentSync 1.6 (published 2025/06/11). 512×512 training improves teeth/mouth blur. Inference VRAM about 18GB
Lightweight versionLatentSync 1.5. Runs on inference VRAM about 8GB. This one if cost is top priority
Just to tryReplicate API: no own GPU, about $0.11 per run, L40S, about 108 seconds. Fastest for a prototype
To build outSelf-hosting: conda + PyTorch 2.5.1. Apache-2.0, so commercial OK. This one for large batch processing
The 2 big quality knobsinference_steps (20→50 for higher quality, slower) / guidance_scale (1.0→3.0 for sync-focus, distortion risk)
Suited usesVideo-localization dubbing, AI anchor/avatar, e-learning, personalized video
Unsuited usesProfile views / heavy motion / material with multiple speakers on screen at once, zero-latency live streaming

If you want to "first check the quality on your own material," jump straight to the Replicate API chapter right after this. You'll get a result in 3 minutes.


What Kind of Model LatentSync Is

The inputs are 2 — a video of a talking person and the audio you want them to speak. The output is 1 — a video with only the mouth re-made to sync to that audio. The face direction, expression, background, and camerawork stay as the source video; only the mouth shape is generated to match the new audio.

This works in scenes like the following, for example.

  • Video localization (dubbing): replace an explainer video shot in Japanese with English narration, and the mouth looks like it's speaking English. Higher immersion than subtitles, raising the CV of overseas expansion.
  • AI anchor / virtual human: apply daily-changing-script audio to once-shot anchor footage, mass-producing daily news with zero re-shoots.
  • E-learning / in-house training: shoot the instructor once, replace the per-chapter audio, and update materials fast. No re-shoot each time the script is revised.
  • Personalizing marketing video: flow audio including the addressee's name and product name into one template video, making "Hello, Mr. ○○" a different video for each person.

Conversely, material where a profile view continues long, the mouth is hidden by a hand or mic, multiple faces appear in one frame at once, or cut switching is heavy breaks the internal face detection / face alignment easily, and quality drops. The pitfalls later show concrete workarounds.


The Mechanism: Why "Diffusion × SyncNet × TREPA" (Made Gentle, Faithful to the Paper)

This chapter breaks down the core of the official paper while keeping it accurate. Those who want only the implementation may jump to the next chapter. That said, "why the pitfalls appear in that shape" makes sense once you understand here.

The Starting Point: Use a Diffusion Model As-Is and "the Mouth Doesn't Match"

LatentSync's base is a Stable Diffusion-family Latent Diffusion Model (LDM). Rather than manipulating the image directly, it does noise removal in a latent space compressed by a VAE, inheriting SD's advantage of being computationally light even at high resolution.

The official structure is this (from the README).

  • Audio is converted from a mel-spectrogram to an audio embedding by Whisper.
  • That audio embedding is injected into the generation process through the U-Net's cross-attention layers.
  • A reference frame and a masked frame are channel-wise concatenated with the noised latent and input to the U-Net.

But what the paper (arXiv:2412.09262) points out is the problem that "naively applying an audio-conditioned LDM gives insufficient lip-sync accuracy." The cause is shortcut learning — the model takes the easy path and learns mainly the visual correlation between adjacent frames, not the relationship between audio and mouth. The result is a video that's "clean as footage, but the audio and mouth are off."

Solution 1: Force "Audio-Mouth Match" with SyncNet Supervision

So LatentSync incorporates SyncNet (a network that judges the sync degree of audio and mouth) as a learning supervision signal. SyncNet scores whether the generated mouth matches the audio, and back-propagating that loss explicitly forces the model to "match audio and mouth." Further, to stabilize convergence, it introduces an architecture called StableSyncNet.

As a quantitative result, the paper reports the SyncNet accuracy on the HDTF test set improved 91% → 94%. This is a direct metric of "how accurately the mouth follows the audio."

Solution 2: With TREPA, Make It "Not Jitter in Time"

Even if you make each frame cleanly, playing it continuously flickers / the teeth waver — this is a typical weakness of diffusion-based lip-sync. With TREPA (Temporal REPresentation Alignment), LatentSync aligns the time-direction representation of the generated frame sequence, raising temporal consistency. At training it applies TREPA, LPIPS, and SyncNet losses in pixel space.

These 3 Points Decide "the Shape of the Pitfalls"

Pin down the mechanism and the troubles described later turn out to be inevitable.

  • Because face alignment is the premise, profile views, occlusion, and multiple faces are weak.
  • The diffusion iteration count (inference_steps) directly trades off quality and speed.
  • Turn up the audio-condition strength (guidance_scale) too far and it tilts toward SyncNet, producing distortion and jitter.

In other words, the parameters are not knobs to turn by "mood"; they are knobs that hold meaning along the extension of this design.


How to Choose a Version: 1.5 vs 1.6

What's currently distributed officially is 1.5 and 1.6, and the code is common. Switching versions is "just load the corresponding checkpoint and change the resolution parameter in the U-Net config" (from the README). 1.6 is just 1.5 with the training data raised to 512×512; the model structure and training strategy are unchanged.

ItemLatentSync 1.5LatentSync 1.6 (latest)
Published2025/06/11
Training resolution256×256-centric512×512
Main purposeLightweight, low VRAMResolving teeth/mouth blur (high definition)
Inference VRAMabout 8GBabout 18GB
Temporal consistencyStrengthened with a temporal layerInherits 1.5
Chinese videoPerformance improved in 1.5Inherited
config (inference)configs/unet/stage2.yaml-familyconfigs/unet/stage2_512.yaml

The choice is simple.

  • Face appears large / quality top priority (the hero cut of a dub, an avatar close-up) → 1.6. Teeth and lip detail pay off.
  • You only have an 8GB-class GPU / want to carve cost with large batches1.5. Plenty practical.
  • If unsure, start trying from 1.6. If you can mount 18GB or more, like L40S/A100/RTX 4090(24GB), 1.6 is the only choice.

⚠️ A point easy to conflate: the HuggingFace repository ID is ByteDance/LatentSync-1.6, but both 1.5/1.6 checkpoints are included in it. setup_env.sh fetches 1.6's latentsync_unet.pt. When you want to use 1.5, choose the corresponding weights and config.


Usage A: Just to Try, the Replicate API (No Own GPU Needed)

At the stage of "first checking whether the quality holds up on my material," hitting it via the API without preparing a GPU is the shortest path. Replicate's bytedance/latentsync is a hosted version wrapping the repository's inference script.

  • Hardware: Nvidia L40S
  • Cost: about $0.11 per run (≒ about 9 runs for $1)
  • Time: typically about 108 seconds (varies greatly with input)
  • Input: video mp4, audio mp3 / aac / wav / m4a

The input fields directly reflect the repository's inference arguments (described later), effectively video / audio / guidance_scale / inference_steps / seed.

📌 A note for accuracy: always confirm the latest default values and ranges of each field in Replicate's "API" tab. Since the hosted-version wrapper follows the official repository's updates, there may be differences from the values at the time of writing. This article's code shows the structure; confirm and adjust the values.

TypeScript (Prototype: Synchronous Execution)

First, a minimal setup. Use Node's official client replicate.

// scripts/lipsync-quickstart.ts
import Replicate from "replicate";

const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

const output = await replicate.run("bytedance/latentsync", {
  input: {
    video: "https://example.com/source.mp4",
    audio: "https://example.com/dub_en.wav",
    guidance_scale: 1.5, // 1.0〜3.0:上げるほど同期重視(歪みに注意)
    inference_steps: 20, // 20〜50:上げるほど高品質・低速
    seed: 1247,          // 再現性のため固定。-1で毎回ランダム
  },
});

console.log("generated video url:", String(output));

This is convenient for the CLI and verification because run() blocks until completion, but use it in a web app's request handler and a 100+-second request stands up, a hotbed for timeouts and double execution. Production uses the next Webhook method.

TypeScript (Production: Async + Idempotency + Webhook)

The requirements to make it production quality are 3 — ① don't block (async), ② don't double-bill on the same input (idempotency), ③ validate external input (type-safe, security). Implement it in a Next.js Route Handler.

// app/api/lipsync/route.ts
import { NextResponse } from "next/server";
import { createHash } from "node:crypto";
import Replicate from "replicate";
import { z } from "zod";

// ① 外部入力は境界で必ず検証する(CLAUDE.md準拠:信頼境界はサーバー側)
const RequestSchema = z.object({
  videoUrl: z.string().url(),
  audioUrl: z.string().url(),
  guidanceScale: z.number().min(1).max(3).default(1.5),
  inferenceSteps: z.number().int().min(20).max(50).default(20),
  seed: z.number().int().default(1247),
});
type LipSyncInput = z.infer<typeof RequestSchema>;

const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

// ② 入力内容から決定的なキーを作る。同じ素材+同じパラメータなら同じジョブ。
function jobKey(input: LipSyncInput): string {
  return createHash("sha256").update(JSON.stringify(input)).digest("hex");
}

export async function POST(req: Request) {
  const parsed = RequestSchema.safeParse(await req.json());
  if (!parsed.success) {
    // 422: 何が不正かを返す(PIIは含めない)
    return NextResponse.json({ error: parsed.error.flatten() }, { status: 422 });
  }
  const input = parsed.data;
  const key = jobKey(input);

  // 冪等性:既存ジョブがあれば作り直さず返す(二重課金・二重生成の防止)
  const existing = await jobStore.get(key);
  if (existing) {
    return NextResponse.json({ jobId: existing.predictionId, status: existing.status, cached: true });
  }

  // ③ 非同期で起動。完了はWebhookで受ける(リクエストはすぐ返る)
  const prediction = await replicate.predictions.create({
    model: "bytedance/latentsync",
    input: {
      video: input.videoUrl,
      audio: input.audioUrl,
      guidance_scale: input.guidanceScale,
      inference_steps: input.inferenceSteps,
      seed: input.seed,
    },
    webhook: `${process.env.PUBLIC_BASE_URL}/api/lipsync/webhook`,
    webhook_events_filter: ["completed"],
  });

  await jobStore.put(key, { predictionId: prediction.id, status: prediction.status });
  return NextResponse.json({ jobId: prediction.id, status: prediction.status, cached: false });
}
// app/api/lipsync/webhook/route.ts
import { NextResponse } from "next/server";
import { verifyReplicateSignature } from "@/lib/replicate-webhook"; // 署名検証(後述の方針)

export async function POST(req: Request) {
  const raw = await req.text();
  // セキュリティ:Webhookは必ず署名検証する。検証なしの本文を信頼しない。
  if (!(await verifyReplicateSignature(req.headers, raw))) {
    return NextResponse.json({ error: "invalid signature" }, { status: 401 });
  }
  const event = JSON.parse(raw) as { id: string; status: string; output?: string };

  if (event.status === "succeeded" && event.output) {
    await jobStore.markDone(event.id, event.output); // 生成URLを保存し、後段に通知
  } else if (event.status === "failed" || event.status === "canceled") {
    await jobStore.markFailed(event.id); // 失敗を記録し、UIへ反映
  }
  return NextResponse.json({ ok: true });
}

The point is that the jobStore (a KV like Vercel KV / Upstash Redis suffices) records the job keyed on sha256(input). This alone gets you idempotency and cost efficiency at once: "even if a user rapid-clicks the submit button it runs only once" and "re-requesting the same video × same audio × same parameters returns the cache." The Webhook is signature-verification-mandatory (Replicate sends a signature header) — an unverified endpoint is a hole anyone can fake "it succeeded" on.

Python / curl

If your server is Python, you can write it isomorphically with the official client.

# pip install replicate
import replicate

output = replicate.run(
    "bytedance/latentsync",
    input={
        "video": "https://example.com/source.mp4",
        "audio": "https://example.com/dub_en.wav",
        "guidance_scale": 1.5,
        "inference_steps": 20,
        "seed": 1247,
    },
)
print(output)  # 生成された動画のURL
# 素のHTTP。CIやシェルからの単発実行に。
curl -s -X POST https://api.replicate.com/v1/predictions \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bytedance/latentsync",
    "input": {
      "video": "https://example.com/source.mp4",
      "audio": "https://example.com/dub_en.wav",
      "guidance_scale": 1.5,
      "inference_steps": 20,
      "seed": 1247
    }
  }'

Usage B: Self-Hosting (Faithful to the Official Procedure)

If large batches, data privacy, or minimizing the per-unit cost is the requirement, it's self-hosting on your own GPU. Here, following the official README / setup_env.sh / requirements.txt as-is is the right answer; arbitrary version changes are a source of dependency breakage.

1. Environment Setup (the contents of the official setup_env.sh)

The official premise is a conda environment + Python 3.10.13.

# システム依存(Ubuntu系)
sudo apt -y install libgl1

# conda環境を作成(Pythonは3.10.13で固定)
conda create -y -n latentsync python=3.10.13
conda activate latentsync

# ffmpegはconda-forgeから
conda install -y -c conda-forge ffmpeg

# Python依存を一括インストール
pip install -r requirements.txt

The main pins in requirements.txt (don't bump them arbitrarily):

torch==2.5.1            # CUDA 12.1 ビルド(--extra-index-url cu121)
torchvision==0.20.1
diffusers==0.32.2
transformers==4.48.0
mediapipe==0.10.11      # 顔ランドマーク
insightface==0.7.3      # 顔検出・整列
onnxruntime-gpu==1.21.0
DeepCache==0.1.1        # enable_deepcache の実体
gradio==5.24.0
numpy==1.26.4

🔧 A gotcha: torch==2.5.1 fetches the CUDA 12.1 build from --extra-index-url https://download.pytorch.org/whl/cu121. If the driver/CUDA don't mesh, onnxruntime-gpu falls back to CPU and it "works but is painfully slow." Always confirm the driver with nvidia-smi and True with python -c "import torch; print(torch.cuda.is_available())".

2. Fetch the Checkpoints

setup_env.sh fetches the weights from HuggingFace. By hand, the next 2 commands.

huggingface-cli download ByteDance/LatentSync-1.6 whisper/tiny.pt --local-dir checkpoints
huggingface-cli download ByteDance/LatentSync-1.6 latentsync_unet.pt --local-dir checkpoints

The final directory structure becomes this.

./checkpoints/
├── latentsync_unet.pt      # 本体(U-Net)
└── whisper/
    └── tiny.pt             # 音声埋め込み用Whisper

3. Inference (the contents of the official inference.sh)

To try with a GUI, python gradio_app.py. For batch / automation, the CLI (./inference.sh). The official inference.sh defaults to 1.6 (512×512), with the contents below.

#!/bin/bash
python -m scripts.inference \
    --unet_config_path "configs/unet/stage2_512.yaml" \
    --inference_ckpt_path "checkpoints/latentsync_unet.pt" \
    --inference_steps 20 \
    --guidance_scale 1.5 \
    --enable_deepcache \
    --video_path "assets/demo1_video.mp4" \
    --audio_path "assets/demo1_audio.wav" \
    --video_out_path "video_out.mp4"

The full arguments scripts.inference takes (per the argparse definition):

ArgumentTypeDefaultRole
--unet_config_pathstrconfigs/unet.yamlU-Net config. 1.6 is configs/unet/stage2_512.yaml
--inference_ckpt_pathstrrequiredU-Net checkpoint
--video_pathstrrequiredInput video
--audio_pathstrrequiredThe audio to apply
--video_out_pathstrrequiredThe output destination
--inference_stepsint20Diffusion iteration count. 20→50 for higher quality, slower
--guidance_scalefloat1.0The audio condition's strength. 1.0→3.0 for sync-focus
--seedint1247Random seed. -1 for random each time
--temp_dirstrtempIntermediate-file location
--enable_deepcacheflagoffSpeed up inference with DeepCache

💡 inference.sh passes guidance_scale 1.5 while the argparse default is 1.0. The recommendation in real operation is around 1.5 — the basis is that the official demo sets it that way. When directly hitting bare python -m scripts.inference, with 1.0 the sync can feel weak, so state it explicitly.

Internally, it runs with float16 if the GPU's compute capability is 8.0 or above (Ampere/Ada generation, e.g. A100, L40S, RTX 30/40), and float32 if below (V100, T4, etc.). On an old GPU, "slower than expected / eats VRAM" is due to this branch.


Practical Parameter Tuning: Per-Use Presets

The official docs show only the ranges (inference_steps [20-50] / guidance_scale [1.0-3.0]). Here I add the right values in practice. The table below is a ready-to-use preset.

Useinference_stepsguidance_scaleenable_deepcacheVersionAim
Draft check (internal review)201.51.5See the shape fast and cheap
Standard dub (YouTube/PR)25–301.5–2.01.6The best point of quality and speed
Close-up hero cut (face large)35–501.51.6Teeth/lip detail top priority
Material with weak sync (fast talk/song)302.0–2.51.6Strengthen the mouth's following
Distortion/jitter appeared+10 stepslower it (→1.5)1.6Revert the over-raised guidance

Rephrasing the meaning of the knobs in practical terms:

  • inference_steps (diffusion iterations): the higher, the cleaner, but linearly slower. 20→40 extends quality but doubles the time. First check the whole at 20 → only the adopted cuts at high steps is the most cost-effective.
  • guidance_scale (the audio condition's strength): raise it and the mouth follows the audio, but raise it too far and the face distorts/jitters. The official docs too state plainly "accuracy rises but with possible distortion/jitter." Base it at 1.5, and if sync is insufficient, in +0.5 steps. Revert if it distorts.
  • seed: fix it and the result is reproducible. With the picture changing each time you do an A/B comparison or "redo," verification doesn't work, so always fix it in production (the default 1247 is fine). Change it only when you want a new variation.
  • enable_deepcache: a speed-up flag by DeepCache. The practical use is ON for drafts, OFF for the final close-up to avoid a slight quality difference.

A cost-direct principle: quality costs time = money roughly in proportion to inference_steps. Running all frames at high steps is very wasteful. The two-tier of "review drafts at 20 → real generation of only the passed cuts at high steps" makes your production unit cost work hardest.


5 Pitfalls That Always Bite in Production, and Resilience Design

These are problems that don't happen with a demo (10 seconds, frontal, no silence) but erupt all at once with real material (tens of minutes, profile views, silence, multiple speakers). I show the cause and countermeasure in the order I stepped on them in real operation. As seen in the mechanism chapter, these appear as a necessity of LatentSync's design.

① Face-Detection / Face-Alignment Failure (Most Frequent)

LatentSync internally detects and aligns the face (mediapipe / insightface) before making the mouth. With profile views, downcast faces, hand/mic occlusion, or extreme size, if detection misses, the mouth collapses or the processing falls over.

Countermeasures:

  • Pre-inspect the presence and frontality of a face yourself before feeding it in, and for cuts where detection fails, pass them through without running lip-sync (pass-through). The judgment of not forcibly syncing every cut protects quality.
  • If multiple faces appear in one frame, identify the speaker and crop before feeding it in.
# 投入前ガード:正面に近い単一の顔があるカットだけをLatentSyncへ回す
import mediapipe as mp

_detector = mp.solutions.face_detection.FaceDetection(model_selection=1, min_detection_confidence=0.6)

def is_lipsyncable(frame_rgb) -> bool:
    """顔が1つ・十分な信頼度で検出できるカットのみTrue。Falseなら原画パススルー。"""
    result = _detector.process(frame_rgb)
    faces = result.detections or []
    return len(faces) == 1  # 複数顔・無顔は同期対象外にして破綻を防ぐ

② VRAM Exhaustion (OOM) on Long Video

A diffusion model eats VRAM in proportion to frame count. Pass a tens-of-minutes video whole in one go and even 18GB falls over with CUDA out of memory.

Countermeasure: split the video into segments, process them sequentially, and concatenate the results. Further, since a silent interval needs no mouth movement, not passing silence to LatentSync via voice-activity detection (VAD) improves both quality (the mouth-flapping hallucination during silence) and cost at once. In my foundation, this VAD pass-through reduced the lip-sync GPU-processing cost by about 40% (detailed in the pipeline-design article).

# OOMを正常系として扱う:失敗したら窓を狭めて再試行(回復性)
def lipsync_segment(seg, *, max_frames: int = 750):
    try:
        return run_latentsync(seg)
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        if seg.frame_count <= 1:
            raise  # これ以上割れない=本物の異常。握りつぶさない
        a, b = seg.split_in_half()      # 二分割してフォールバックを局所化
        return concat(lipsync_segment(a), lipsync_segment(b))

③ fps / Resolution / Audio-Format Mismatch

If the input fps is inconsistent or the audio is an unexpected codec, the mouth and audio drift overall or processing fails.

Countermeasure: insert one stage of normalizing with ffmpeg before feeding it in. Just fixing the fps and aligning the audio to 16kHz/wav makes reproducibility and stability jump.

# 投入前の正規化:fps固定・音声をwav 16kHzに統一
ffmpeg -y -i input.mp4 -r 25 -c:v libx264 -pix_fmt yuv420p norm_video.mp4
ffmpeg -y -i dub.m4a -ar 16000 -ac 1 dub.wav

④ Teeth/Mouth Blur

Teeth blur in 1.5 — this was a weakness the official docs recognized too, and 1.6 resolves it with 512×512 training.

Countermeasure: for material where the face appears large, first 1.6. If it still bothers you, raise inference_steps. Continuing to use 1.5 for close-up-heavy content is the same as taking on an already-solved problem.

⑤ Checking Quality Only by "Eye"

The most dangerous thing in production is the quality gate being only human eyeballing. In large batches, broken cuts will surely slip through.

Countermeasure: mechanically score the sync quality with the evaluation scripts bundled in the LatentSync repository, and send only the cuts below a threshold to human review or regeneration. Making the SyncNet confidence a quantitative gate drastically cuts review effort.

# 生成物の音と口の同期度を機械採点(CIや後処理に組み込む)
./eval/eval_sync_conf.sh     # SyncNet confidence
./eval/eval_syncnet_acc.sh   # SyncNet accuracy

Production-Operation Design Principles (Observability, Idempotency, Resilience, Cost)

Let me re-summarize the pitfall countermeasures in the language of operational quality. This is the difference between "works" and "doesn't fall over in production."

  • Idempotency: make sha256(video + audio + steps + guidance + seed) the job key and cache the result. Don't double-generate on resend, rapid clicks, or retries. Same design for Replicate and self-hosted GPU.
  • Resilience: treat OOM, face-detection failure, and GPU preemption not as exceptions but as normal cases. With "shrink the window and retry" and "pass through the source for un-syncable cuts," prevent one cut's failure from dragging the whole down.
  • Observability: leave in structured logs, per cut, which version, how many steps, what guidance, what SyncNet confidence. Make it a state where you can later trace "why did only this cut's mouth collapse." Don't emit PII (face, audio content) to logs.
  • Cost efficiency: carve the unit cost with 3 tiers — ① pass through silence with VAD, ② draft 20 steps → only adopted cuts at high steps, ③ zero regeneration via the idempotency cache. With Replicate, $0.11 per run accumulates directly, so reducing wasted shots works directly.
  • The self-hosting unit-cost feel: filling and spinning an 18GB-class spot/on-demand GPU in batches tends to make the per-unit cheaper than Replicate. But adding the cost of environment setup, operation, and GPU procurement, the break-even guide is API for small volume, self-hosting for steady large volume.

Comparison with Other Lip-Sync Models

An organization to answer "so which do I use, after all?" Choosing by use is the right answer; there's no all-purpose one.

ModelMethodStrengthsWeaknessesLicenseSuited scene
LatentSyncDiffusion (audio-conditioned LDM) + SyncNet/TREPAOverall power of naturalness, temporal consistency, sync accuracyRelevant VRAM, weak on profile viewsApache-2.0High-quality dubbing, avatars, materials
Wav2LipGAN-family (lightweight)Light, fast, runs on low specLow resolution / mouth detailResearch-use-centric (confirm)Prototype, low-load mass processing
SadTalker3DMM + face generationCan talk from 1 still imageTends to lag video-based naturalnessConfirmTalking head from a photo
MuseTalkReal-time-orientedFast, near-real-timeScenes where sync stability yields a step to diffusion-familyConfirmLive-leaning, low-latency requirements

Practical decisions are roughly thus.

  • Quality is paramount, and you want to clear commercial use under Apache-2.0LatentSync.
  • You only have 1 still image → SadTalker.
  • Low-latency, near-real-time is the requirement → MuseTalk.
  • Just light, in bulk → drafts with Wav2Lip, real generation of the adopted ones with LatentSync — a two-tier is also effective.

Each model's license gets updated, so always confirm commercial use with the primary source. LatentSync is Apache-2.0 (official repository) at the time of writing.


Frequently Asked Questions (FAQ)

Q. Can I use it commercially? A. LatentSync is Apache-2.0 licensed in the official repository. It supports commercial use, but the portrait rights of the people in the output and the rights to the audio are a separate matter. Making a real person speak without their consent carries large legal/ethical risk, so use it with consented material.

Q. Can I use it with Japanese audio? A. You can. Audio is conditioned via Whisper's embedding, generating mouth movement language-independently. With the background that Chinese-video performance improved in 1.5, it's practical for non-English too.

Q. What GPU / VRAM is needed? A. For inference, 1.6 is about 18GB, 1.5 is about 8GB. 1.6 is L40S/A100/RTX 4090(24GB) class, and 1.5 runs even on 8–16GB class. Training is a different matter: 512×512 Stage 2 needs 55GB (normally training is unneeded, inference alone suffices).

Q. Can I use it for live streaming? A. Unsuited. It's diffusion-based and takes seconds to minutes per run (typically 108 seconds on Replicate). For live use, consider a low-latency model like MuseTalk.

Q. Can I match the mouth even on profile views? A. It's weak. Internal face alignment is the premise, and the closer to frontal, the more stable. For cuts where a profile view continues, not syncing and passing through the source is the realistic solution that protects quality.

Q. What's the minimum/maximum video length? A. There's no explicit upper bound, but long video requires segment splitting due to VRAM constraints. In practice, assemble on the premise of "segment splitting + silence skipping + idempotency cache."

Q. Which is cheaper, own GPU or the API? A. API for small volume (zero environment setup, $0.11 per run). For steady large volume, self-hosting that fills a spot GPU in batches tends to favor the per-unit. The royal road is to first verify quality and demand with the API, and move to self-hosting once volume grows.


Summary: Moving LatentSync from "Works" to "Earns in Production"

The essence of LatentSync lies in achieving both "diffusion's naturalness" and "accurate audio-mouth sync" with SyncNet supervision and TREPA. That's exactly why its quality is high, and exactly why you need to understand and handle the design-rooted knobs of face alignment, iteration count, and condition strength.

The path to implementation is simple.

  1. First check the quality of your own material with the Replicate API (no own GPU, $0.11 per run).
  2. If you feel it, self-host 1.6 and refine inference_steps / guidance_scale with the per-use presets.
  3. Production weaves idempotency, resilience, observability, and cost into the design — face-detection guard, segment splitting, silence skipping with VAD, the SyncNet-confidence gate.

Only after going this far does it become not a demo but a product that "doesn't fall over on the customer's material." And, the point I most want to convey — this very chain of design judgments is where outsourcing makes a difference. Anyone can make a "just wire the model" demo, but a foundation that doesn't break with real material of long video, profile views, silence, and multiple speakers turns the number of mines you've stepped on directly into quality.

I implemented this article's pitfalls and resilience design in an AI video-localization platform actually running in production. If you're considering building or improving a video AI pipeline including lip-sync, see my portfolio and feel free to consult me. With one person × generative AI, I build through, from PoC to production operation, fast, cheap, and safe.


Sources / Official Resources

  • Versions, parameters, pricing, and licenses get updated. Always confirm the primary source before implementing. This article's numbers (SyncNet 91%→94%, inference VRAM 8GB/18GB, 512×512, about $0.11/108 seconds・L40S, etc.) are based on the official information at the time of writing.
友田

友田 陽大

Developer of a METI Minister's Award–winning product. With TypeScript + Python + AWS, I deliver SaaS, industry DX, and production-grade generative AI (RAG) end to end — from requirements to infrastructure and operations — single-handedly.

Got a challenge?

From design to implementation and operations — solo × generative AI

Implementation like this article's, end to end from requirements to production. Start with a free 30-minute technical consult and tell me about your situation.

Available for both project-based (contract) and advisory engagements. Start with a free 30-minute consult.

Also worth reading