Introduction: with "just connecting models," video localization can't run in production
"Transcribe with Whisper, translate with an LLM, dub with TTS, and match the mouth with a lip-sync model" — this single sentence is correct as a recipe for demoing the full automation of video localization in 5 minutes. But something implemented straight from it will definitely break in front of the customer.
Why? Because video localization is a typically unstable pipeline where "heavy, expensive, probabilistically-failing GPU processing" is connected in five serial stages. Problems that don't surface in a demo video (10 seconds, frontal, no silence) erupt all at once with real material (30 minutes, multiple speakers, long silences, cut transitions, profiles).
This article, taking as its subject the real product "AI video-localization / lip-sync platform" that I single-handedly designed and implemented, discloses at the implementation level the design judgments that raised this unstable pipeline to a quality that withstands production operation. The themes it covers consolidate into the following three points.
- Don't crash (Reliability): run a tens-of-minutes GPU job to completion across a spot instance's forced stop.
- Cheaply (Cost): shave GPU inference, the most expensive resource, without dropping quality.
- Naturally (Quality): suppress the duration gap of machine translation and the hallucination of the diffusion model, and match the mouth and audio without breakdown.
An overview of the related track record is summarized in AI video-localization / lip-sync platform. This article is the technical edition that digs into "why I built it that way."
The big picture: the five-stage pipeline and the "swappable" skeleton that supports it
First, let's fix the processing flow. One video passes through the following five stages in order.
upload
└─ ① audio separation (separate vocals / BGM)
└─ ② transcription (STT, automatic language detection, timestamped subtitles)
└─ ③ translation (per subtitle, giving the speech duration as a constraint)
└─ ④ speech synthesis (voice cloning, isochrony fit)
└─ ⑤ lip-sync (mouth synchronization, subtitle burn-in)
└─ completed video
On each stage's success, the status transitions to EDITING (user-reviewable), inserting a "human-in-the-loop" where the subtitles/translation can be manually corrected. If any stage fails, it becomes FAILED, and records at which stage and what failed in error_stage / error_message. This becomes the foundation for the resumption/debugging described later.
Design principle: the AI engine is "implementation detail," not "architecture"
The most important decision in this domain was driving "which model to use" out of the structure.
Lip-sync SOTA gets replaced on a months-long cycle. In fact, in this product too, the protagonist moved Wav2Lip family → MuseTalk → LatentSync, and translation also switched from NLLB to vLLM (Qwen3). If the code were tightly coupled to a specific model, the architecture would collapse every time the model is updated.
So I composed every stage with the three-piece set "interface → provider → factory." Each stage defines a contract with an abstract base class (ABC), the concrete implementation (provider) is selected by an environment variable, and creation is consolidated into the factory.
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass(frozen=True, slots=True)
class TranscriptionResult:
segments: list["Segment"] # (start, end, text) のタイムスタンプ付き
detected_language: str
language_probability: float
class Transcriber(ABC):
"""文字起こしの契約。実装はこの背後に隠れる。"""
@abstractmethod
async def transcribe(
self, audio_path: str, lang: str | None = None
) -> TranscriptionResult: ...
The factory has two crux points. Environment-variable-driven selection and lazy import.
import importlib
from app.core.config import settings
from app.core.enums import STTProvider
# enum値 → (モジュールパス, クラス名)。重いMLライブラリはここでは import しない。
_STT_PROVIDERS: dict[STTProvider, tuple[str, str]] = {
STTProvider.FASTER_WHISPER: ("app.services.processors.transcriber.faster_whisper", "FasterWhisperTranscriber"),
STTProvider.REAZON_SPEECH: ("app.services.processors.transcriber.reazon_speech", "ReazonSpeechTranscriber"),
STTProvider.OPENAI_WHISPER: ("app.services.processors.transcriber.openai_whisper", "OpenAIWhisperTranscriber"),
}
def create_transcriber() -> Transcriber:
module_path, class_name = _STT_PROVIDERS[settings.stt_provider]
module = importlib.import_module(module_path) # 選ばれた実装だけを遅延ロード
return getattr(module, class_name)()
Why lazy import is vital. Libraries like faster-whisper, vLLM, and the diffusion-model family grab the CUDA context just by being imported at the top level, reserving several hundred MB to several GB of VRAM. If you import outside the factory, a worker process that uses only TTS gets dragged into exhaustion, even the VRAM of the lip-sync library. The position of the import statement directly becomes a production OOM factor — this is an easily-overlooked pitfall specific to GPU apps.
With this skeleton, the six layers of audio separation, transcription, translation, speech synthesis, lip-sync, and storage became swappable with zero code changes, by environment variable alone.
| Layer | Interface | Main implementations | Selection key |
|---|---|---|---|
| Audio separation | AudioSeparator | FFmpeg / Demucs / UVR5(MDX-Net) | SEPARATOR_PROVIDER |
| Transcription | Transcriber | faster-whisper(large-v3) / ReazonSpeech / OpenAI Whisper | STT_PROVIDER |
| Translation | Translator | vLLM(Qwen3-8B-AWQ) / 4-bit quantized Llama-3 | TRANSLATOR_PROVIDER |
| Speech synthesis | Synthesizer | voice-cloning-capable TTS (HTTP service) | TTS_PROVIDER |
| Lip-sync | LipSyncer | MuseTalk / LatentSync / VideoReTalking | LIPSYNC_PROVIDER |
| Storage | StorageBackend | local FS / Azure Blob | STORAGE_TYPE |
This is the practical application of the Open-Closed principle. Adding a new engine is just "write one new provider and add one line to the registry." It doesn't touch the existing code at all.
The first wall: "don't crash" the long GPU job
Why you must not process inside an HTTP request
Lip-sync of a 30-minute video can take over an hour on a single T4. Executing this synchronously inside the API's request-response is out of the question. Timeout, double execution via retry, invisible progress — everything breaks down.
The solution is clear: evacuate all heavy processing asynchronously to a Celery worker. But there's one constraint specific to GPU apps. The worker runs with --pool=threads --concurrency=1.
celery -A app.worker.celery_app worker --pool=threads --concurrency=1 -Q gpu -n gpu@%h
The reason to fix the parallelism at 1 is to serialize the GPU. If you flow two diffusion-model inferences simultaneously onto one T4, the VRAM is exhausted instantly. The default prefork pool, which forks processes, has poor compatibility with the CUDA context and vLLM's daemon process, and silently freezes. "Parallelization to speed it up" becoming "parallelization to surely break it" is the counterintuitive point of GPU workloads. Earn throughput with contrivances inside the segment, and keep it serial between tasks — this resignation produces stability.
Withstanding spot interruption: segment splitting + persistent cache + resumption
For cost, the production GPU uses a spot VM (T4 / Standard_NC8as_T4_v3). Spot is cheap, but in return it's forcibly stopped without notice for the cloud side's convenience. If an hour of lip-sync blows up at 59 minutes, a naive implementation redoes it from the start. This isn't viable as a business.
So I split the long video into segments and cache each segment's output on a persistent disk. The persistent disk has prevent_destroy attached in Terraform, protecting it to survive even if the VM is recreated.
The design of the cache key decides the correctness of the resumption. If the input and parameters are the same, the output is deterministically the same — to guarantee this idempotency, the key is derived from the following elements.
import hashlib
def segment_cache_key(
project_id: str,
audio_id: str,
segment_seconds: float,
engine: str, # "musetalk" / "latentsync" ...
tuning: str, # inference_steps, guidance_scale 等を畳んだ署名
sync_spans: tuple[tuple[float, float], ...], # 発話区間の確定計画
) -> str:
raw = f"{project_id}|{audio_id}|{segment_seconds}|{engine}|{tuning}|{sync_spans}"
return hashlib.sha256(raw.encode()).hexdigest()[:16]
The resumption logic falls to the simple discipline "if it's in the cache, don't compute."
async def lipsync_segment(seg: Segment, ctx: LipSyncContext) -> Path:
key = segment_cache_key(ctx.project_id, ctx.audio_id, ctx.segment_seconds,
ctx.engine, ctx.tuning, ctx.sync_spans)
cached = ctx.cache_dir / f"{key}_{seg.index}.mp4"
if cached.exists():
return cached # スポット中断後の再開はここで効く(冪等)
out = await ctx.engine.sync(seg) # 重いGPU処理は未計算分だけ
out.replace(cached) # アトミックに確定
return cached
Even if killed by spot, after the restart it starts running from the next after the last completed segment. Since engine and tuning are included in the key, changing the engine or tuning automatically becomes a separate cache, and erroneous reuse of old results doesn't happen either. This is the substance of a resumable and idempotent pipeline.
Per-stage retry and "precisely failing" exception design
Transient failures (network drops, momentary memory pressure) recover with retry, and permanent failures (invalid input, out-of-license models) stop immediately. For this distinction, I designed exceptions in a hierarchy of 15 kinds.
class InfiniteBridgeError(Exception): ...
class ProcessingError(InfiniteBridgeError):
def __init__(self, stage: str, message: str) -> None:
self.stage = stage # FAILED時に error_stage へ記録
super().__init__(message)
class TranscriptionError(ProcessingError): ...
class TranslationError(ProcessingError): ...
class TTSGenerationError(ProcessingError):
def __init__(self, segment_index: int, message: str) -> None:
self.segment_index = segment_index # どのセグメントが落ちたか
super().__init__("synthesize", message)
class LipSyncError(ProcessingError): ...
class GpuServiceError(InfiniteBridgeError): ... # 外部GPUサービスの異常
class TaskCancelledError(InfiniteBridgeError): ... # ユーザーキャンセル(失敗ではない)
class PathTraversalError(InfiniteBridgeError): ... # ストレージ境界の防御
Retry optimizes the count per stage, and for the external GPU services hit over HTTP (TTS, lip-sync), I unified tenacity-based exponential backoff. What's important is distinguishing cancellation from "failure." TaskCancelledError doesn't drop to FAILED but quietly deregisters the task. The cancellation flag is held in Redis, a cooperative cancellation referenced at each stage's natural checkpoint.
# 失敗ではないキャンセルを、失敗パスに巻き込まない
if await is_cancelled(project_id):
raise TaskCancelledError(project_id)
The second wall: make GPU cost "cheap" — discard silence cleverly
Observation: the mouth in a video is, most of the time, "not moving"
GPU inference is the most expensive resource in this product. And observing the video, most frames are actually not speaking — the silent intervals, BGM-only cuts, the listener's expression. Nevertheless, a naive implementation passes all frames through the diffusion model. This is economically broken, and moreover degrades quality. Because diffusion-model lip-sync has the known problem of "hallucinating" the mouth on silent intervals and faceless frames.
In other words, not passing silence through the GPU simultaneously achieves cost reduction and quality improvement — a two-birds-one-stone optimization.
Synthesizing speech segments: subtitle slots ∪ audio energy
I synthesize "the intervals actually speaking" from two information sources. The union of the translated-subtitle slots (human-confirmed speech timing) and the energy spans of the dubbed audio (the intervals where sound is actually playing). Attach a little margin before and after (a run-up for the mouth's opening/closing), join adjacent intervals (short silences), and fix the "sync window" that activates the GPU.
def plan_sync_windows(
subtitle_slots: list[Span],
audio_energy_spans: list[Span],
margin: float = 0.4, # 口の開閉のための前後余白(秒)
merge_gap: float = 2.0, # この秒数未満の無音は同一ウィンドウへ結合
) -> list[Span]:
spans = sorted(
Span(s.start - margin, s.end + margin)
for s in (*subtitle_slots, *audio_energy_spans)
)
merged: list[Span] = []
for s in spans:
if merged and s.start - merged[-1].end < merge_gap:
merged[-1] = Span(merged[-1].start, max(merged[-1].end, s.end))
else:
merged.append(s)
return merged # この区間だけGPUへ。残りは原映像を温存。
Since outside the fixed window, the original video's frames are used as-is, the mouth stays natural as in the original, and the GPU consumes not a single frame. With this speech-segment detection, I reduced the GPU processing cost of lip-sync by about 40% while simultaneously resolving the mouth breakdown during silence.
The essence in design lies not in "optimization = make the processing faster" but in "find the intervals not to process in the first place." The fastest GPU processing is GPU processing you don't run.
Structural cost optimization: spot + smart auto-stop
In addition to algorithmic reduction, I structurally hold down cost on the infrastructure side too. I greatly compressed the on-demand ratio with spot VMs and implemented a smart auto-stop that safely shuts down looking at task state. Not a simple time trigger, but stopping after confirming Celery's active tasks, queue, and user sessions, so it prevents the accident of killing a video mid-processing while pushing idle GPU billing toward zero. I layer Terraform's DevTest auto-stop as the final backstop, making it a double safety net.
The third wall: make dubbing "natural" — isochrony, a plain and essential problem
Translation translates "meaning" but doesn't translate "duration"
If you dub machine translation as-is, you definitely hit this problem. The original and translated languages take different times for the same-meaning utterance. A 3-second English line becomes 5 seconds in Japanese, or vice versa. If you ignore this and pour synthesized audio into the subtitle slot, it's crushed by fast speech or, drawn out, intrudes into the next line. Before mouth synchronization, the audio itself doesn't fit into the slot.
This constraint of "fit the utterance into the allocated time" is called isochrony. In this product, I combine four means in priority order to absorb the duration gap.
- Borrow the following silent gap: if there's silence right after the slot, spill a little into it. But always leave a
0.15-secondbreath margin (the naturalness of taking a breath). - Raise the speech rate (up to 1.2×): if borrowing isn't enough, speed it up within a range that doesn't harm perceptual quality (up to 20%).
- Time-stretch (Rubberband, up to 1.1×): conversely, audio that's too short is stretched up to 10% while keeping the pitch.
- If it still doesn't fit, reject it at a quality gate: if the per-segment TTS failure rate exceeds
20%, interrupt the processing and gracefully degrade with silence insertion.
def fit_to_slot(audio: AudioSegment, slot: Span, next_gap: float) -> AudioSegment:
overflow = audio.duration - slot.duration
if overflow <= 0:
return rubberband_stretch(audio, ratio=min(slot.duration / audio.duration,
MAX_STRETCH)) # 1.1倍上限
borrowable = max(0.0, next_gap - BORROW_GUARD) # 0.15秒は残す
overflow -= min(overflow, borrowable) # まずギャップ借用
if overflow > 0:
speedup = min((audio.duration) / (audio.duration - overflow), MAX_SPEEDUP)
audio = time_compress(audio, speedup) # 次に話速(1.2倍上限)
return audio
What's non-trivial is that the upper limit of speech rate (1.2×) and the upper limit of stretching (1.1×) are asymmetric. Human perception is more sensitive to "audio that's too slow" than "audio that's too fast," and the artifacts of stretching are more noticeable. So I take more room on the compressing side and strictly narrow the stretching side. This is the result of dropping the knowledge of psychoacoustics into parameters, and I believe being able to explain the basis of the numbers is the proof of production quality.
The fourth wall: harden the diffusion model in production
The highest-quality lip-sync is obtained with a diffusion model (LatentSync v1.5), but the diffusion model is also the component that breaks most easily if you put research code into production as-is. I applied three hardenings.
(1) Frame-rate normalization and alignment in 16-frame units
LatentSync internally presupposes 25fps, and the UNet operates in chunk units of 16 frames. Since input videos are various — 24/30/60fps — I normalize the input to 25fps, align the segment to a multiple of 16 frames before processing, and return the output to the original fps. Neglecting this truncates frames at the chunk boundary, and "lip drift," where the mouth shifts for an instant, occurs.
(2) Avoid OOM "by design": the physical upper limit of a 30-second window
The diffusion pipeline decodes the entire inference window into host RAM. At 720p, even a 60-second window exceeded the 22GB RAM limit and was OOM-killed. Here, "increasing memory" is a defeat. A spot VM's RAM is finite, and increasing it bounces back to cost. Instead, I fixed the segment window to 30 seconds and built the physical guarantee that peak RAM doesn't exceed the limit into the design. MuseTalk (a lighter latent diffusion) is fine with a 120-second window, so I change the window size per engine.
| Engine | Method | Segment window | fps normalization | Positioning |
|---|---|---|---|---|
| MuseTalk | latent diffusion (fast) | 120 sec | not needed | speed-first, standard quality |
| LatentSync v1.5 | diffusion (high quality) | 30 sec | to 25fps | quality-first, high mouth precision |
| VideoReTalking | GAN-based | — | — | retired pending license verification |
The engine and tuning are presented to users abstracted as four presets. Internally they resolve to a bundle of "engine + number of inference steps + guidance strength."
| Preset | Engine | Inference steps | Guidance | Feel |
|---|---|---|---|---|
| FAST | MuseTalk(256) | — | — | fastest, standard |
| BALANCED | MuseTalk(384, tuned) | — | — | the middle ground of speed and quality |
| HIGH_QUALITY | LatentSync 1.5 | 25 | 2.0 | high quality |
| ULTRA | LatentSync 1.5 | 30 | 2.0 | highest quality |
(3) Isolate only "faceless frames" with binary search
This is the biggest hurdle. If a diffusion model has even one frame with no face in the window (a slide transition, a logo screen, a cut), it hard-fails the inference of the entire window. The conventional naive implementation, in that case, fell back the entire 30-second window to "dub only (no mouth synchronization)." For the sake of one faceless frame, it discards 29 seconds of good-quality sync.
So I split a window that failed to sync with binary search, recursively narrow down the failing side, and isolate the faceless frame down to a 3-second lower limit.
def sync_with_bisect(window: Span, ctx: LipSyncContext) -> list[Result]:
try:
return [ctx.engine.sync(window)] # まず丸ごと試す
except LipSyncError:
if window.duration <= BISECT_MIN: # 3秒下限
return [dub_only(window)] # ここだけ吹き替えで救済
mid = window.start + window.duration / 2
return [
*sync_with_bisect(Span(window.start, mid), ctx),
*sync_with_bisect(Span(mid, window.end), ctx), # 失敗源を半分に切り詰める
]
As a result, the impact range of the fallback shrank from "the entire window" to "the smallest interval that actually has no face." This applies a classic pattern for availability (localize the fallback) to the diffusion model's failure characteristics.
The discipline running through all stages: type safety and 100% coverage
The contrivances up to here lose their value in an instant if they regress. If the spot-resumption cache key is off by one element, erroneous reuse occurs; if one of the isochrony upper limits changes, dubbing breaks down. So this product laid down quality gates without compromise.
- The backend mandates 100% test coverage. Falling short in CI is a build failure. All external I/O (GPU services, storage, DB) is mocked, verifying the logic quickly and deterministically.
- mypy strict, Ruff, Vulture keep types, static analysis, and dead code at zero errors. The
anytype andprint()are structurally prohibited. - The frontend validates at the boundary with Zod. It adopts Next.js 16 / React 19 (Compiler enabled) / Mantine / TanStack Query, and with the API response's Zod schema as the single source of truth, separates server state (TanStack Query) and UI state (Zustand). It doesn't bring in
any, and eliminates dead code with ESLint and Knip.
100% coverage isn't "numbers for the sake of numbers." It's proof that the structure can mock all external dependencies = the dependencies are correctly abstracted, and is two sides of the same coin as the aforementioned plugin-style architecture. Testability appears as a result of good design.
Summary: production quality means fully designing the "way of breaking"
In the full automation of AI video localization, the part connecting models is 20% of the whole. The remaining 80% — the "don't crash, cheaply, naturally, localize the way of breaking" covered in this article — is the boundary line that separates a demo from production. Let me re-list the key points.
- Drive the AI engine out into implementation detail. With interface → provider → factory + lazy load, localize the impact of model updates.
- Design on the premise of spot interruption. With segment splitting + idempotent cache keys, make it a resumable, idempotent pipeline.
- The fastest processing is processing you don't do. With speech-segment detection, don't pass silence through the GPU, improving cost and quality simultaneously (about 40% reduction).
- Control isochrony with numbers. Parameterize gap borrowing, the speech-rate upper limit, and time-stretching, together with the basis in perceptual quality.
- Harden the diffusion model in production. fps normalization, OOM-avoiding window design, binary-search fallback localization.
- Don't compromise on quality gates. Type safety and 100% coverage prevent the regression of all the above.
None of these are flashy. But whether you can fully design the non-flashy parts is what raises an AI product from a "demo" to a "business." That I single-handedly handled everything from requirement definition to the GPU infrastructure and production operation, and was able to receive #1 in CrowdWorks' engineer-division overall weekly contract ranking in this project's evaluation, is, I believe, the result of placing value on this plainness.
If you have a challenge with the productionization of AI video, a GPU pipeline, or a diffusion model, I can consult with you end-to-end from design to infrastructure. Please see the specific track record from the link below.