An AI video-localization and lip-sync platform
A GPU-inference pipeline that, from just a video upload, runs audio separation → transcription → translation → multilingual dubbing → lip-sync end to end | Earned #1 on the CrowdWorks contract ranking from the evaluation of this project
Client
An AI video-localization SaaS for a marketing-support company | Domain: multilingual video expansion (audio separation, transcription, machine translation, voice-clone dubbing, diffusion-model lip-sync) | Setup: solo, from pipeline design through GPU infrastructure and production operations
My role
AI systems architect and full-stack developer (solo across pipeline design, frontend, backend, and GPU infrastructure).
Challenge (Situation & Task)
The traditional way to expand one video into many languages — "translate → record dubbing → swap the mouth" by hand — costs hundreds of thousands of yen and weeks per language. A production-grade, practical-quality AI pipeline was needed that, from just a video upload, runs audio separation, transcription, translation, multilingual dubbing, and lip-sync end to end.
Video localization is a domain where hard problems chain in series.
-
Completing long GPU jobs: heavy GPU processing — audio separation → transcription → translation → speech synthesis → lip-sync — had to finish without breaking, even on tens-of-minutes videos. Adopting spot GPUs to cut cost means the cloud can forcibly stop them without notice, so resumability from interruption is a precondition.
-
Natural dubbing (isochrony): dubbing machine translation as-is makes utterance length differ greatly between source and target (e.g. EN→JA changes timing a lot), so mouth and audio drift. Without controlling speaking rate and silence, quality breaks down into rushed or dragging speech.
-
Diffusion-model-specific failures: diffusion-based lip-sync has a known problem of "hallucinating" mouths on frames where no face appears or during silence; applying it naively to every frame harms quality.
-
GPU inference cost: GPU inference is expensive, and dutifully processing even silent stretches doesn't pencil out. How to cut wasted GPU time while preserving quality determined viability.
Why these technologies (Rationale)
FastAPI + Celery + Redis: decouple tens-of-minutes GPU jobs from the HTTP request and control progress, cancellation, and retries with async workers. Workers run a threads pool with concurrency 1 to serialize the GPU, structurally avoiding contention and VRAM exhaustion.
Plugin architecture (interface → provider → factory): make all 6 layers — audio separation, transcription, translation, speech synthesis, lip-sync, plus storage — swappable via environment variables alone. Heavy ML libraries are lazy-loaded to localize the impact of model updates.
Commercially usable open quantized models: transcription with faster-whisper (large-v3 / CTranslate2 / int8–float16 on GPU), translation with vLLM (Qwen3-8B-AWQ, Triton attention on T4) and 4-bit-quantized Llama-3, dubbing with voice-clone-capable TTS, and lip-sync with MuseTalk (latent diffusion) and LatentSync v1.5. License, quality, and speed trade-offs are optimized stage by stage.
Azure spot GPU + Terraform (IaC): Tesla T4 spot VMs (Standard_NC8as_T4_v3) sharply compress GPU cost while codifying VNet/NSG, GPU, storage, and auto-shutdown. The persistent data disk survives VM re-creation via prevent_destroy to protect the resume cache.
PostgreSQL + SQLAlchemy (async): projects, subtitles, translations, synthesized audio, and output videos are modeled relationally, strictly tracking each stage's state and error stage/message so jobs can resume.
What I did (Action)
[Resumable long jobs] Videos are split into segments aligned to speech (MuseTalk: 120-second windows / LatentSync: 30-second windows), and each segment's output is cached on the persistent disk. Even if a spot GPU is forcibly stopped, a cache key derived from project, audio, engine, tuning, and speech segments matches completed segments and resumes from there.
[GPU savings via speech-segment detection] Combining dubbed-audio energy with subtitle segments to extract only "actually speaking" stretches (0.4-second margins front and back; silences under 2 seconds merged into the same window), silent stretches preserve the original footage without GPU. This cuts GPU processing by ~40% and eliminates mouth hallucinations during silence.
[Isochrony (lip-sync) control] The duration gap between translation and source is absorbed by borrowing silent gaps (0.15-second breath margin), a 1.2× speaking-rate cap, and Rubberband time-stretch capped at 1.1×. A quantitative gate aborts when a segment's TTS failure rate exceeds 20%, degrading gracefully by inserting silence.
[Production hardening of diffusion models] LatentSync normalizes input to 25fps and processes in 16-frame units, capped at 30-second windows so host RAM stays under 22GB. On lip-sync failure it binary-searches the window down to a 3-second floor, cutting out only face-less frames so the rest of the segment syncs successfully (avoiding the conventional fallback of dubbing the entire window).
[Robustness & observability] Each stage has stage-specific retries and fallbacks (on lip-sync failure, output dubbed audio only), precise error classification via 15 exception types with sanitization, and cooperative cancellation via a Redis flag. The GPU service is monitored with health checks behind Caddy and structured logs.
[Quality gates] The backend requires 100% test coverage (CI fails if unmet). mypy strict, Ruff, and Vulture keep types and static analysis at zero errors, and the frontend ensures type safety with React Compiler, ESLint, Knip, and Zod.
The crux of this product was a design that raises expensive, unstable GPU processing to "production-grade" quality.
Reliability (a resumable, idempotent pipeline): Every stage is an async task, with each stage's state and failure point managed in the DB. Long videos are segmented along speech and cached individually, so even if a spot GPU is interrupted or the network drops, processing resumes from the last completed segment. Because the cache key is derived from the input, engine, and tuning, re-running under the same conditions idempotently reuses existing results.
Quality (silence-skipping + isochrony + binary search): Extracting only speech segments from the dubbed audio and subtitle segments — and using the original footage for silence — prevents the diffusion model's mouth hallucinations while cutting GPU time by ~40%. The duration gap between translation and source is absorbed by borrowing silent gaps, a speaking-rate cap (1.2×), and time-stretch (1.1× cap). Failed windows are binary-searched down to a 3-second floor, separating only face-less frames to rescue the surrounding sync.
Maintainability (plugin architecture + lazy loading): The 6 layers — audio separation, transcription, translation, speech synthesis, lip-sync, and storage — are hidden behind a common interface and swappable via environment variables. Heavy ML libraries are lazy-loaded inside the factory to minimize startup cost and coupling.
Security & cost: Azure uses keyless auth via managed identity (no shared keys issued), NSGs default-deny + explicitly allow, and storage implements path-traversal protection. Spot VMs + smart auto-shutdown that watches task state structurally curb idle GPU billing.
Key technical decisions
Celery + Redis: making long GPU jobs async with progress, cooperative cancellation, and stage-specific retries
Segmentation + persistent cache: a resumable, idempotent design that withstands spot-GPU interruptions
Speech-segment detection (silence skipping): GPU-cost savings and suppression of diffusion-model mouth hallucinations
Interface → provider → factory: making the 6 AI-engine layers swappable to localize future updates
Responsibilities
- AI pipeline design (orchestrating audio separation, STT, translation, TTS, lip-sync)
- Frontend development (Next.js 16 / React 19 / Mantine / TanStack Query / Zod)
- Backend development (FastAPI / Python / Celery / SQLAlchemy async)
- GPU infrastructure build (Azure spot T4 / Terraform IaC / Caddy)
- Quality assurance (100% test coverage / mypy strict / Ruff / Vulture)
Technologies
Results in numbers
- Languages supported
- 8languagesTranslation, multilingual dubbing, and subtitles.
- GPU cost reduction
- 40%Reduced by processing only speech segments (silence skipping).
- Backend test coverage
- 100%CI fails below 100%.
- Swappable AI-engine layers
- 6layersSeparation / STT / translation / TTS / lip-sync / storage.
Results
- The evaluation of this project earned #1 on the CrowdWorks weekly contract ranking, in both the Engineer division and Overall
- Fully automated from video upload to multilingual dubbed-video output (supporting 8 languages: Japanese, English, Chinese, Korean, Spanish, French, German, Portuguese)
- Cut lip-sync GPU processing cost by ~40% by sending only speech segments to the GPU
- A resumable, idempotent segment pipeline that withstands spot-GPU forced stops completes even long videos
- Made the 6 layers — audio separation, transcription, translation, speech synthesis, lip-sync — swappable via environment variables alone, localizing the impact of model updates
- Maintained zero errors with 100% backend test coverage, mypy strict, Ruff, and Vulture to ensure production quality
同様の課題、抱えていませんか?
あなたのビジネス課題も、最新の技術で解決できます。 まずは30分の無料技術相談から、状況をお聞かせください。
自社の課題もSaaS化できるか相談するプロジェクト単位(請負)・技術顧問、どちらにも対応可能です