# LatentSync Complete Guide: Running ByteDance's Diffusion Lip-Sync Model in Production, Faithful to the Official Docs

> An explanation of ByteDance's audio-conditioned latent-diffusion lip-sync model LatentSync, faithful to the official documentation (GitHub, paper, HuggingFace). The mechanism of the latest 1.6, both the Replicate API and self-hosting procedures, tuning of inference_steps/guidance_scale, and resilience design against face-detection failure / OOM / audio drift — the implementation needed for production shown with concrete code.

- Published: 2026-06-24
- Author: 友田 陽大
- Tags: LatentSync, リップシンク, AI動画, 拡散モデル, Python, GPU, 生成AI, MLOps
- URL: https://tomodahinata.com/en/blog/latentsync-lip-sync-diffusion-model-production-guide
- Category: Lip-sync & digital humans
- Pillar guide: https://tomodahinata.com/en/blog/ai-lip-sync-talking-head-model-selection-guide-2026

## Key points

- LatentSync is ByteDance's 'audio-conditioned latent diffusion model.' With SyncNet supervision and TREPA it erases the 'only the mouth moves, unnatural' feel, and the paper improves SyncNet accuracy on HDTF from 91%→94%
- The latest 1.6 resolves teeth/mouth blur with 512×512 training. Inference VRAM is about 18GB (1.5 is about 8GB), and the model structure is unchanged from 1.5 — only the checkpoint is swapped
- Just to try, the Replicate API (L40S, about $0.11 per run, about 108 seconds); to build it out, self-hosting (conda + PyTorch 2.5.1, Apache-2.0) — think in these two choices
- Control quality with inference_steps [20-50] and guidance_scale [1.0-3.0], and speed it up with enable_deepcache. Per-use presets are presented in ready-to-use form
- In production, face-detection failure, long-video OOM, and fps/audio-format mismatch will surely occur. Guarantee resilience with segment splitting, an idempotency cache, and a SyncNet-confidence gate

---

## The Goal of This Article

LatentSync is a **lip-sync diffusion model that generates mouth movement from audio**, published by ByteDance. The GitHub repository's name directly expresses its philosophy — "**Taming Stable Diffusion for Lip Sync!**"

This article, while **strictly based on the official documentation ([GitHub](https://github.com/bytedance/LatentSync) / [paper](https://arxiv.org/abs/2412.09262) / [HuggingFace](https://huggingface.co/ByteDance/LatentSync-1.6))**, fills in — with actually-running code — "**in which scene, how to use it, and where it clogs**," which isn't written in the official README. By the time you finish reading, I aim for a state where you can do the following 3 things.

1. Explain to others **what kind of model LatentSync is, and why its quality is high.**
2. Judge which of the **Replicate API (no own GPU needed)** and **self-hosting** to choose, and **get your hands moving today.**
3. Assemble a **resilient implementation** that withstands not a demo but **production** — face-detection failure, long-video OOM, audio drift.

> **About the author (disclosure for credibility)**: I have single-handedly designed, implemented, and **run in production** an **AI video-localization platform** that fully automates, from just uploading a video, everything from "audio separation → transcription → translation → multilingual dubbing → lip-sync." The implementation protagonist of its 5th stage (lip-sync) moved through Wav2Lip-family → MuseTalk to **LatentSync.** This article's "pitfalls" and "resilience design" are not demo knowledge but a record of the mines I stepped on in that real operation. The whole pipeline's design is in a [separate article](/blog/production-ai-video-localization-lipsync-gpu-pipeline), and the project overview is compiled at the [portfolio link](/case-studies/ai-video-localization-lipsync) at the end of this article.

---

## A 30-Second Summary (Conclusion First)

| Viewpoint | Conclusion |
| --- | --- |
| **What kind of model** | A diffusion model that, from 1 talking video + arbitrary audio, generates "**a video with the mouth re-made as if speaking that audio.**" |
| **What's impressive** | With SyncNet supervision + TREPA, it achieves both "audio and mouth match yet natural." The paper reports it surpasses the then-SOTA on HDTF/VoxCeleb2 |
| **Latest version** | **LatentSync 1.6** (published 2025/06/11). 512×512 training improves teeth/mouth blur. Inference **VRAM about 18GB** |
| **Lightweight version** | **LatentSync 1.5.** Runs on inference **VRAM about 8GB.** This one if cost is top priority |
| **Just to try** | **Replicate API**: no own GPU, **about $0.11 per run**, L40S, about 108 seconds. Fastest for a prototype |
| **To build out** | **Self-hosting**: conda + PyTorch 2.5.1. **Apache-2.0**, so commercial OK. This one for large batch processing |
| **The 2 big quality knobs** | `inference_steps` (20→50 for higher quality, slower) / `guidance_scale` (1.0→3.0 for sync-focus, distortion risk) |
| **Suited uses** | Video-localization dubbing, AI anchor/avatar, e-learning, personalized video |
| **Unsuited uses** | Profile views / heavy motion / material with multiple speakers on screen at once, zero-latency live streaming |

If you want to "**first check the quality on your own material**," jump straight to the [Replicate API chapter](#usage-a-just-to-try-the-replicate-api-no-own-gpu-needed) right after this. You'll get a result in 3 minutes.

---

## What Kind of Model LatentSync Is

The inputs are 2 — **a video of a talking person** and **the audio you want them to speak.** The output is 1 — **a video with only the mouth re-made to sync to that audio.** The face direction, expression, background, and camerawork stay as the source video; only the mouth shape is generated to match the new audio.

This works in scenes like the following, for example.

- **Video localization (dubbing)**: replace an explainer video shot in Japanese with English narration, and **the mouth looks like it's speaking English.** Higher immersion than subtitles, raising the CV of overseas expansion.
- **AI anchor / virtual human**: apply daily-changing-script audio to once-shot anchor footage, mass-producing **daily news with zero re-shoots.**
- **E-learning / in-house training**: shoot the instructor once, replace the per-chapter audio, and **update materials fast.** No re-shoot each time the script is revised.
- **Personalizing marketing video**: flow audio including the addressee's name and product name into one template video, making **"Hello, Mr. ○○" a different video for each person.**

Conversely, material where **a profile view continues long**, **the mouth is hidden by a hand or mic**, **multiple faces appear in one frame at once**, or **cut switching is heavy** breaks the internal face detection / face alignment easily, and quality drops. The [pitfalls](#5-pitfalls-that-always-bite-in-production-and-resilience-design) later show concrete workarounds.

---

## The Mechanism: Why "Diffusion × SyncNet × TREPA" (Made Gentle, Faithful to the Paper)

This chapter breaks down the core of the official paper **while keeping it accurate.** Those who want only the implementation may jump to the [next chapter](#how-to-choose-a-version-15-vs-16). That said, "why the pitfalls appear in that shape" makes sense once you understand here.

### The Starting Point: Use a Diffusion Model As-Is and "the Mouth Doesn't Match"

LatentSync's base is a **Stable Diffusion-family Latent Diffusion Model (LDM).** Rather than manipulating the image directly, it does **noise removal in a latent space** compressed by a VAE, inheriting SD's advantage of being computationally light even at high resolution.

The official structure is this (from the README).

- Audio is converted from a mel-spectrogram to an **audio embedding** by **Whisper.**
- That audio embedding is injected into the generation process through **the U-Net's cross-attention layers.**
- A **reference frame and a masked frame** are **channel-wise concatenated** with the noised latent and input to the U-Net.

But what the paper ([arXiv:2412.09262](https://arxiv.org/abs/2412.09262)) points out is the problem that **"naively applying an audio-conditioned LDM gives insufficient lip-sync accuracy."** The cause is **shortcut learning** — the model takes the easy path and learns mainly **the visual correlation between adjacent frames**, not **the relationship between audio and mouth.** The result is a video that's "clean as footage, but the audio and mouth are off."

### Solution 1: Force "Audio-Mouth Match" with SyncNet Supervision

So LatentSync **incorporates SyncNet** (a network that judges the sync degree of audio and mouth) **as a learning supervision signal.** SyncNet scores whether the generated mouth matches the audio, and back-propagating that loss **explicitly forces the model to "match audio and mouth."** Further, to stabilize convergence, it introduces an architecture called **StableSyncNet.**

As a quantitative result, the paper reports **the SyncNet accuracy on the HDTF test set improved 91% → 94%.** This is a direct metric of "how accurately the mouth follows the audio."

### Solution 2: With TREPA, Make It "Not Jitter in Time"

Even if you make each frame cleanly, **playing it continuously flickers / the teeth waver** — this is a typical weakness of diffusion-based lip-sync. With **TREPA (Temporal REPresentation Alignment)**, LatentSync **aligns the time-direction representation** of the generated frame sequence, raising temporal consistency. At training it applies **TREPA, LPIPS, and SyncNet losses in pixel space.**

### These 3 Points Decide "the Shape of the Pitfalls"

Pin down the mechanism and the troubles described later turn out to be **inevitable.**

- Because face alignment is the premise, **profile views, occlusion, and multiple faces** are weak.
- The diffusion iteration count (`inference_steps`) directly trades off quality and speed.
- Turn up the audio-condition strength (`guidance_scale`) too far and it tilts toward SyncNet, producing **distortion and jitter.**

In other words, the parameters are not knobs to turn by "mood"; they are **knobs that hold meaning along the extension of this design.**

---

## How to Choose a Version: 1.5 vs 1.6

What's currently distributed officially is **1.5** and **1.6**, and the **code is common.** Switching versions is "**just load the corresponding checkpoint and change the resolution parameter in the U-Net config**" (from the README). 1.6 is **just 1.5 with the training data raised to 512×512; the model structure and training strategy are unchanged.**

| Item | LatentSync 1.5 | LatentSync 1.6 (latest) |
| --- | --- | --- |
| Published | — | **2025/06/11** |
| Training resolution | 256×256-centric | **512×512** |
| Main purpose | Lightweight, low VRAM | **Resolving teeth/mouth blur** (high definition) |
| Inference VRAM | **about 8GB** | **about 18GB** |
| Temporal consistency | Strengthened with a temporal layer | Inherits 1.5 |
| Chinese video | Performance improved in 1.5 | Inherited |
| config (inference) | `configs/unet/stage2.yaml`-family | `configs/unet/stage2_512.yaml` |

**The choice is simple.**

- **Face appears large / quality top priority** (the hero cut of a dub, an avatar close-up) → **1.6.** Teeth and lip detail pay off.
- **You only have an 8GB-class GPU / want to carve cost with large batches** → **1.5.** Plenty practical.
- If unsure, **start trying from 1.6.** If you can mount 18GB or more, like L40S/A100/RTX 4090(24GB), 1.6 is the only choice.

> ⚠️ **A point easy to conflate**: the HuggingFace repository ID is `ByteDance/LatentSync-1.6`, but **both 1.5/1.6 checkpoints** are included in it. `setup_env.sh` fetches 1.6's `latentsync_unet.pt`. When you want to use 1.5, choose the corresponding weights and config.

---

## Usage A: Just to Try, the Replicate API (No Own GPU Needed)

At the stage of "first checking whether the quality holds up on my material," **hitting it via the API without preparing a GPU** is the shortest path. Replicate's [`bytedance/latentsync`](https://replicate.com/bytedance/latentsync) is a hosted version wrapping the repository's inference script.

- **Hardware**: Nvidia **L40S**
- **Cost**: **about $0.11 per run** (≒ about 9 runs for $1)
- **Time**: typically **about 108 seconds** (varies greatly with input)
- **Input**: video `mp4`, audio `mp3 / aac / wav / m4a`

The input fields directly reflect the repository's inference arguments (described later), effectively `video` / `audio` / `guidance_scale` / `inference_steps` / `seed`.

> 📌 **A note for accuracy**: **always confirm the latest default values and ranges of each field in Replicate's "API" tab.** Since the hosted-version wrapper follows the official repository's updates, there may be differences from the values at the time of writing. This article's code shows the **structure**; confirm and adjust the values.

### TypeScript (Prototype: Synchronous Execution)

First, a minimal setup. Use Node's official client `replicate`.

```ts
// scripts/lipsync-quickstart.ts
import Replicate from "replicate";

const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

const output = await replicate.run("bytedance/latentsync", {
  input: {
    video: "https://example.com/source.mp4",
    audio: "https://example.com/dub_en.wav",
    guidance_scale: 1.5, // 1.0〜3.0：上げるほど同期重視（歪みに注意）
    inference_steps: 20, // 20〜50：上げるほど高品質・低速
    seed: 1247,          // 再現性のため固定。-1で毎回ランダム
  },
});

console.log("generated video url:", String(output));
```

This is convenient for the CLI and verification because `run()` **blocks until completion**, but use it in a web app's request handler and **a 100+-second request** stands up, a hotbed for timeouts and double execution. **Production uses the next Webhook method.**

### TypeScript (Production: Async + Idempotency + Webhook)

The requirements to make it production quality are 3 — **① don't block (async)**, **② don't double-bill on the same input (idempotency)**, **③ validate external input (type-safe, security).** Implement it in a Next.js Route Handler.

```ts
// app/api/lipsync/route.ts
import { NextResponse } from "next/server";
import { createHash } from "node:crypto";
import Replicate from "replicate";
import { z } from "zod";

// ① 外部入力は境界で必ず検証する（CLAUDE.md準拠：信頼境界はサーバー側）
const RequestSchema = z.object({
  videoUrl: z.string().url(),
  audioUrl: z.string().url(),
  guidanceScale: z.number().min(1).max(3).default(1.5),
  inferenceSteps: z.number().int().min(20).max(50).default(20),
  seed: z.number().int().default(1247),
});
type LipSyncInput = z.infer<typeof RequestSchema>;

const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

// ② 入力内容から決定的なキーを作る。同じ素材＋同じパラメータなら同じジョブ。
function jobKey(input: LipSyncInput): string {
  return createHash("sha256").update(JSON.stringify(input)).digest("hex");
}

export async function POST(req: Request) {
  const parsed = RequestSchema.safeParse(await req.json());
  if (!parsed.success) {
    // 422: 何が不正かを返す（PIIは含めない）
    return NextResponse.json({ error: parsed.error.flatten() }, { status: 422 });
  }
  const input = parsed.data;
  const key = jobKey(input);

  // 冪等性：既存ジョブがあれば作り直さず返す（二重課金・二重生成の防止）
  const existing = await jobStore.get(key);
  if (existing) {
    return NextResponse.json({ jobId: existing.predictionId, status: existing.status, cached: true });
  }

  // ③ 非同期で起動。完了はWebhookで受ける（リクエストはすぐ返る）
  const prediction = await replicate.predictions.create({
    model: "bytedance/latentsync",
    input: {
      video: input.videoUrl,
      audio: input.audioUrl,
      guidance_scale: input.guidanceScale,
      inference_steps: input.inferenceSteps,
      seed: input.seed,
    },
    webhook: `${process.env.PUBLIC_BASE_URL}/api/lipsync/webhook`,
    webhook_events_filter: ["completed"],
  });

  await jobStore.put(key, { predictionId: prediction.id, status: prediction.status });
  return NextResponse.json({ jobId: prediction.id, status: prediction.status, cached: false });
}
```

```ts
// app/api/lipsync/webhook/route.ts
import { NextResponse } from "next/server";
import { verifyReplicateSignature } from "@/lib/replicate-webhook"; // 署名検証（後述の方針）

export async function POST(req: Request) {
  const raw = await req.text();
  // セキュリティ：Webhookは必ず署名検証する。検証なしの本文を信頼しない。
  if (!(await verifyReplicateSignature(req.headers, raw))) {
    return NextResponse.json({ error: "invalid signature" }, { status: 401 });
  }
  const event = JSON.parse(raw) as { id: string; status: string; output?: string };

  if (event.status === "succeeded" && event.output) {
    await jobStore.markDone(event.id, event.output); // 生成URLを保存し、後段に通知
  } else if (event.status === "failed" || event.status === "canceled") {
    await jobStore.markFailed(event.id); // 失敗を記録し、UIへ反映
  }
  return NextResponse.json({ ok: true });
}
```

The point is that the `jobStore` (a KV like Vercel KV / Upstash Redis suffices) records the job **keyed on `sha256(input)`.** This alone gets you **idempotency and cost efficiency** at once: "even if a user rapid-clicks the submit button it runs only once" and "re-requesting the same video × same audio × same parameters returns the cache." The Webhook is **signature-verification-mandatory** (Replicate sends a signature header) — an unverified endpoint is a hole anyone can fake "it succeeded" on.

### Python / curl

If your server is Python, you can write it isomorphically with the official client.

```python
# pip install replicate
import replicate

output = replicate.run(
    "bytedance/latentsync",
    input={
        "video": "https://example.com/source.mp4",
        "audio": "https://example.com/dub_en.wav",
        "guidance_scale": 1.5,
        "inference_steps": 20,
        "seed": 1247,
    },
)
print(output)  # 生成された動画のURL
```

```bash
# 素のHTTP。CIやシェルからの単発実行に。
curl -s -X POST https://api.replicate.com/v1/predictions \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bytedance/latentsync",
    "input": {
      "video": "https://example.com/source.mp4",
      "audio": "https://example.com/dub_en.wav",
      "guidance_scale": 1.5,
      "inference_steps": 20,
      "seed": 1247
    }
  }'
```

---

## Usage B: Self-Hosting (Faithful to the Official Procedure)

If large batches, data privacy, or minimizing the per-unit cost is the requirement, it's **self-hosting on your own GPU.** Here, **following the official `README` / `setup_env.sh` / `requirements.txt` as-is** is the right answer; arbitrary version changes are a source of dependency breakage.

### 1. Environment Setup (the contents of the official `setup_env.sh`)

The official premise is a **conda environment + Python 3.10.13.**

```bash
# システム依存（Ubuntu系）
sudo apt -y install libgl1

# conda環境を作成（Pythonは3.10.13で固定）
conda create -y -n latentsync python=3.10.13
conda activate latentsync

# ffmpegはconda-forgeから
conda install -y -c conda-forge ffmpeg

# Python依存を一括インストール
pip install -r requirements.txt
```

The main pins in `requirements.txt` (**don't bump them arbitrarily**):

```text
torch==2.5.1            # CUDA 12.1 ビルド（--extra-index-url cu121）
torchvision==0.20.1
diffusers==0.32.2
transformers==4.48.0
mediapipe==0.10.11      # 顔ランドマーク
insightface==0.7.3      # 顔検出・整列
onnxruntime-gpu==1.21.0
DeepCache==0.1.1        # enable_deepcache の実体
gradio==5.24.0
numpy==1.26.4
```

> 🔧 **A gotcha**: `torch==2.5.1` fetches the **CUDA 12.1** build from `--extra-index-url https://download.pytorch.org/whl/cu121`. If the driver/CUDA don't mesh, `onnxruntime-gpu` falls back to CPU and it "**works but is painfully slow.**" Always confirm the driver with `nvidia-smi` and True with `python -c "import torch; print(torch.cuda.is_available())"`.

### 2. Fetch the Checkpoints

`setup_env.sh` fetches the weights from HuggingFace. By hand, the next 2 commands.

```bash
huggingface-cli download ByteDance/LatentSync-1.6 whisper/tiny.pt --local-dir checkpoints
huggingface-cli download ByteDance/LatentSync-1.6 latentsync_unet.pt --local-dir checkpoints
```

The final directory structure becomes this.

```text
./checkpoints/
├── latentsync_unet.pt      # 本体（U-Net）
└── whisper/
    └── tiny.pt             # 音声埋め込み用Whisper
```

### 3. Inference (the contents of the official `inference.sh`)

To try with a GUI, `python gradio_app.py`. For batch / automation, the CLI (`./inference.sh`). The official `inference.sh` **defaults to 1.6 (512×512)**, with the contents below.

```bash
#!/bin/bash
python -m scripts.inference \
    --unet_config_path "configs/unet/stage2_512.yaml" \
    --inference_ckpt_path "checkpoints/latentsync_unet.pt" \
    --inference_steps 20 \
    --guidance_scale 1.5 \
    --enable_deepcache \
    --video_path "assets/demo1_video.mp4" \
    --audio_path "assets/demo1_audio.wav" \
    --video_out_path "video_out.mp4"
```

The **full arguments** `scripts.inference` takes (per the argparse definition):

| Argument | Type | Default | Role |
| --- | --- | --- | --- |
| `--unet_config_path` | str | `configs/unet.yaml` | U-Net config. 1.6 is `configs/unet/stage2_512.yaml` |
| `--inference_ckpt_path` | str | **required** | U-Net checkpoint |
| `--video_path` | str | **required** | Input video |
| `--audio_path` | str | **required** | The audio to apply |
| `--video_out_path` | str | **required** | The output destination |
| `--inference_steps` | int | `20` | Diffusion iteration count. 20→50 for higher quality, slower |
| `--guidance_scale` | float | `1.0` | The audio condition's strength. 1.0→3.0 for sync-focus |
| `--seed` | int | `1247` | Random seed. `-1` for random each time |
| `--temp_dir` | str | `temp` | Intermediate-file location |
| `--enable_deepcache` | flag | off | Speed up inference with DeepCache |

> 💡 `inference.sh` passes `guidance_scale 1.5` while the argparse default is `1.0`. **The recommendation in real operation is around 1.5** — the basis is that the official demo sets it that way. When directly hitting bare `python -m scripts.inference`, with `1.0` the sync can feel weak, so state it explicitly.

Internally, it runs with **`float16` if the GPU's compute capability is 8.0 or above** (Ampere/Ada generation, e.g. A100, L40S, RTX 30/40), and `float32` if below (V100, T4, etc.). On an old GPU, "slower than expected / eats VRAM" is due to this branch.

---

## Practical Parameter Tuning: Per-Use Presets

The official docs show only the ranges (`inference_steps` [20-50] / `guidance_scale` [1.0-3.0]). Here I add **the right values in practice.** The table below is a ready-to-use preset.

| Use | `inference_steps` | `guidance_scale` | `enable_deepcache` | Version | Aim |
| --- | --- | --- | --- | --- | --- |
| **Draft check** (internal review) | 20 | 1.5 | ✅ | 1.5 | See the shape fast and cheap |
| **Standard dub** (YouTube/PR) | 25–30 | 1.5–2.0 | ✅ | 1.6 | The best point of quality and speed |
| **Close-up hero cut** (face large) | 35–50 | 1.5 | ❌ | 1.6 | Teeth/lip detail top priority |
| **Material with weak sync** (fast talk/song) | 30 | 2.0–2.5 | ✅ | 1.6 | Strengthen the mouth's following |
| **Distortion/jitter appeared** | +10 steps | **lower it** (→1.5) | ✅ | 1.6 | Revert the over-raised guidance |

Rephrasing the meaning of the knobs in practical terms:

- **`inference_steps` (diffusion iterations)**: the higher, the cleaner, but **linearly slower.** 20→40 extends quality but doubles the time. **First check the whole at 20 → only the adopted cuts at high steps** is the most cost-effective.
- **`guidance_scale` (the audio condition's strength)**: raise it and **the mouth follows the audio**, but raise it too far and **the face distorts/jitters.** The official docs too state plainly "**accuracy rises but with possible distortion/jitter.**" **Base it at 1.5, and if sync is insufficient, in +0.5 steps.** Revert if it distorts.
- **`seed`**: **fix it and the result is reproducible.** With the picture changing each time you do an A/B comparison or "redo," verification doesn't work, so **always fix it in production** (the default `1247` is fine). Change it only when you want a new variation.
- **`enable_deepcache`**: a **speed-up flag** by DeepCache. The practical use is ON for drafts, OFF for the final close-up to avoid a slight quality difference.

> **A cost-direct principle**: quality costs **time = money roughly in proportion to `inference_steps`.** Running all frames at high steps is very wasteful. The two-tier of "**review drafts at 20 → real generation of only the passed cuts at high steps**" makes your production unit cost work hardest.

---

## 5 Pitfalls That Always Bite in Production, and Resilience Design

These are problems that don't happen with a demo (10 seconds, frontal, no silence) but **erupt all at once with real material (tens of minutes, profile views, silence, multiple speakers).** I show the cause and countermeasure in the order I stepped on them in real operation. As seen in the mechanism chapter, these appear as **a necessity of LatentSync's design.**

### ① Face-Detection / Face-Alignment Failure (Most Frequent)

LatentSync internally detects and aligns the face (mediapipe / insightface) before making the mouth. With **profile views, downcast faces, hand/mic occlusion, or extreme size**, if detection misses, the mouth collapses or the processing falls over.

**Countermeasures**:

- **Pre-inspect the presence and frontality of a face yourself** before feeding it in, and for cuts where detection fails, **pass them through without running lip-sync (pass-through).** The judgment of not forcibly syncing every cut protects quality.
- If **multiple faces** appear in one frame, identify the speaker and **crop before** feeding it in.

```python
# 投入前ガード：正面に近い単一の顔があるカットだけをLatentSyncへ回す
import mediapipe as mp

_detector = mp.solutions.face_detection.FaceDetection(model_selection=1, min_detection_confidence=0.6)

def is_lipsyncable(frame_rgb) -> bool:
    """顔が1つ・十分な信頼度で検出できるカットのみTrue。Falseなら原画パススルー。"""
    result = _detector.process(frame_rgb)
    faces = result.detections or []
    return len(faces) == 1  # 複数顔・無顔は同期対象外にして破綻を防ぐ
```

### ② VRAM Exhaustion (OOM) on Long Video

A diffusion model **eats VRAM in proportion to frame count.** Pass a tens-of-minutes video whole in one go and even 18GB falls over with **CUDA out of memory.**

**Countermeasure**: **split the video into segments, process them sequentially, and concatenate the results.** Further, since a silent interval needs no mouth movement, **not passing silence to LatentSync via voice-activity detection (VAD)** improves both quality (the mouth-flapping hallucination during silence) and cost at once. In my foundation, this VAD pass-through **reduced the lip-sync GPU-processing cost by about 40%** (detailed in the [pipeline-design article](/blog/production-ai-video-localization-lipsync-gpu-pipeline)).

```python
# OOMを正常系として扱う：失敗したら窓を狭めて再試行（回復性）
def lipsync_segment(seg, *, max_frames: int = 750):
    try:
        return run_latentsync(seg)
    except torch.cuda.OutOfMemoryError:
        torch.cuda.empty_cache()
        if seg.frame_count <= 1:
            raise  # これ以上割れない＝本物の異常。握りつぶさない
        a, b = seg.split_in_half()      # 二分割してフォールバックを局所化
        return concat(lipsync_segment(a), lipsync_segment(b))
```

### ③ fps / Resolution / Audio-Format Mismatch

If the input **fps is inconsistent** or the audio is an unexpected codec, the mouth and audio **drift overall** or processing fails.

**Countermeasure**: insert one stage of **normalizing with ffmpeg** before feeding it in. Just fixing the fps and aligning the audio to 16kHz/wav makes reproducibility and stability jump.

```bash
# 投入前の正規化：fps固定・音声をwav 16kHzに統一
ffmpeg -y -i input.mp4 -r 25 -c:v libx264 -pix_fmt yuv420p norm_video.mp4
ffmpeg -y -i dub.m4a -ar 16000 -ac 1 dub.wav
```

### ④ Teeth/Mouth Blur

**Teeth blur in 1.5** — this was a weakness the official docs recognized too, and **1.6 resolves it with 512×512 training.**

**Countermeasure**: for material where the face appears large, **first 1.6.** If it still bothers you, raise `inference_steps`. Continuing to use 1.5 for close-up-heavy content is the same as taking on an already-solved problem.

### ⑤ Checking Quality Only by "Eye"

The most dangerous thing in production is **the quality gate being only human eyeballing.** In large batches, broken cuts will surely slip through.

**Countermeasure**: **mechanically score the sync quality** with the evaluation scripts bundled in the LatentSync repository, and send only the cuts below a threshold to human review or regeneration. Making the SyncNet confidence a **quantitative gate** drastically cuts review effort.

```bash
# 生成物の音と口の同期度を機械採点（CIや後処理に組み込む）
./eval/eval_sync_conf.sh     # SyncNet confidence
./eval/eval_syncnet_acc.sh   # SyncNet accuracy
```

---

## Production-Operation Design Principles (Observability, Idempotency, Resilience, Cost)

Let me re-summarize the pitfall countermeasures in the language of operational quality. This is the difference between "works" and "doesn't fall over in production."

- **Idempotency**: make `sha256(video + audio + steps + guidance + seed)` the **job key** and cache the result. Don't double-generate on resend, rapid clicks, or retries. Same design for Replicate and self-hosted GPU.
- **Resilience**: treat OOM, face-detection failure, and GPU preemption **not as exceptions but as normal cases.** With "shrink the window and retry" and "pass through the source for un-syncable cuts," prevent one cut's failure from dragging the whole down.
- **Observability**: leave in structured logs, per cut, **which version, how many steps, what guidance, what SyncNet confidence.** Make it a state where you can later trace "why did only this cut's mouth collapse." Don't emit PII (face, audio content) to logs.
- **Cost efficiency**: carve the unit cost with 3 tiers — ① pass through silence with VAD, ② draft 20 steps → only adopted cuts at high steps, ③ zero regeneration via the idempotency cache. With Replicate, $0.11 per run **accumulates directly**, so reducing wasted shots works directly.
- **The self-hosting unit-cost feel**: **filling and spinning** an 18GB-class spot/on-demand GPU in batches tends to make the per-unit cheaper than Replicate. But adding the cost of environment setup, operation, and GPU procurement, the break-even guide is **API for small volume, self-hosting for steady large volume.**

---

## Comparison with Other Lip-Sync Models

An organization to answer "so which do I use, after all?" **Choosing by use** is the right answer; there's no all-purpose one.

| Model | Method | Strengths | Weaknesses | License | Suited scene |
| --- | --- | --- | --- | --- | --- |
| **LatentSync** | Diffusion (audio-conditioned LDM) + SyncNet/TREPA | **Overall power of naturalness, temporal consistency, sync accuracy** | Relevant VRAM, weak on profile views | **Apache-2.0** | High-quality dubbing, avatars, materials |
| **Wav2Lip** | GAN-family (lightweight) | Light, fast, runs on low spec | Low resolution / mouth detail | Research-use-centric (confirm) | Prototype, low-load mass processing |
| **SadTalker** | 3DMM + face generation | **Can talk from 1 still image** | Tends to lag video-based naturalness | Confirm | Talking head from a photo |
| **MuseTalk** | Real-time-oriented | **Fast, near-real-time** | Scenes where sync stability yields a step to diffusion-family | Confirm | Live-leaning, low-latency requirements |

Practical decisions are roughly thus.

- **Quality is paramount, and you want to clear commercial use under Apache-2.0** → **LatentSync.**
- **You only have 1 still image** → SadTalker.
- **Low-latency, near-real-time is the requirement** → MuseTalk.
- **Just light, in bulk** → drafts with Wav2Lip, real generation of the adopted ones with LatentSync — a **two-tier** is also effective.

> Each model's license gets updated, so **always confirm commercial use with the primary source.** LatentSync is **Apache-2.0** (official repository) at the time of writing.

---

## Frequently Asked Questions (FAQ)

**Q. Can I use it commercially?**
A. LatentSync is **Apache-2.0** licensed in the official repository. It supports commercial use, but **the portrait rights of the people in the output and the rights to the audio are a separate matter.** Making a real person speak without their consent carries large legal/ethical risk, so use it with **consented material.**

**Q. Can I use it with Japanese audio?**
A. You can. Audio is conditioned via Whisper's embedding, **generating mouth movement language-independently.** With the background that **Chinese-video performance improved in 1.5**, it's practical for non-English too.

**Q. What GPU / VRAM is needed?**
A. For inference, **1.6 is about 18GB, 1.5 is about 8GB.** 1.6 is L40S/A100/RTX 4090(24GB) class, and 1.5 runs even on 8–16GB class. Training is a different matter: 512×512 Stage 2 needs **55GB** (normally training is unneeded, inference alone suffices).

**Q. Can I use it for live streaming?**
A. Unsuited. It's diffusion-based and takes **seconds to minutes** per run (typically 108 seconds on Replicate). **For live use, consider a low-latency model like MuseTalk.**

**Q. Can I match the mouth even on profile views?**
A. It's weak. Internal face alignment is the premise, and **the closer to frontal, the more stable.** For cuts where a profile view continues, **not syncing and passing through the source** is the realistic solution that protects quality.

**Q. What's the minimum/maximum video length?**
A. There's no explicit upper bound, but **long video requires segment splitting due to VRAM constraints.** In practice, assemble on the premise of "segment splitting + silence skipping + idempotency cache."

**Q. Which is cheaper, own GPU or the API?**
A. **API for small volume** (zero environment setup, $0.11 per run). For **steady large volume**, self-hosting that fills a spot GPU in batches tends to favor the per-unit. The royal road is to first verify quality and demand with the API, and move to self-hosting once volume grows.

---

## Summary: Moving LatentSync from "Works" to "Earns in Production"

The essence of LatentSync lies in **achieving both "diffusion's naturalness" and "accurate audio-mouth sync" with SyncNet supervision and TREPA.** That's exactly why its quality is high, and exactly why you need to understand and handle the **design-rooted knobs** of face alignment, iteration count, and condition strength.

The path to implementation is simple.

1. First check the quality of your own material with the **Replicate API** (no own GPU, $0.11 per run).
2. If you feel it, **self-host 1.6** and refine `inference_steps` / `guidance_scale` with the per-use presets.
3. **Production weaves idempotency, resilience, observability, and cost** into the design — face-detection guard, segment splitting, silence skipping with VAD, the SyncNet-confidence gate.

Only after going this far does it become not a demo but a product that "**doesn't fall over on the customer's material.**" And, the point I most want to convey — **this very chain of design judgments is where outsourcing makes a difference.** Anyone can make a "just wire the model" demo, but **a foundation that doesn't break with real material of long video, profile views, silence, and multiple speakers** turns the number of mines you've stepped on directly into quality.

> I implemented this article's pitfalls and resilience design in an **AI video-localization platform actually running in production.** If you're considering building or improving a video AI pipeline including lip-sync, see my [portfolio](/case-studies/ai-video-localization-lipsync) and feel free to consult me. With **one person × generative AI**, I build through, from PoC to production operation, fast, cheap, and safe.

---

## Sources / Official Resources

- **Official repository (GitHub)**: [bytedance/LatentSync](https://github.com/bytedance/LatentSync) — README, `inference.sh`, `setup_env.sh`, `requirements.txt`, training/evaluation scripts
- **Paper (arXiv)**: [LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync (2412.09262)](https://arxiv.org/abs/2412.09262) — SyncNet supervision, TREPA, StableSyncNet, quantitative results
- **Model distribution (HuggingFace)**: [ByteDance/LatentSync-1.6](https://huggingface.co/ByteDance/LatentSync-1.6) — 1.5/1.6 checkpoints
- **Hosting API (Replicate)**: [bytedance/latentsync](https://replicate.com/bytedance/latentsync) — input schema, pricing, execution environment

* Versions, parameters, pricing, and licenses get updated. **Always confirm the primary source before implementing.** This article's numbers (SyncNet 91%→94%, inference VRAM 8GB/18GB, 512×512, about $0.11/108 seconds・L40S, etc.) are based on the official information at the time of writing.