JUST-DUB-IT:基于联合视听扩散模型的视频配音技术
JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion
January 29, 2026
作者: Anthony Chen, Naomi Ken Korem, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, Daniel Cohen-Or
cs.AI
摘要
视听基础模型通过预训练实现音视频内容的联合生成,近期展现出多模态生成与编辑方面的突破性能力,为下游任务开辟了新路径。其中,视频配音任务可显著受益于此先验知识,但现有方案大多依赖复杂的任务专用流程,难以应对现实场景的挑战。本研究提出一种单模型解决方案,通过轻量级LoRA适配基础音视频扩散模型,实现视频到视频的配音功能。该LoRA模块使模型能够以输入音视频为条件,同步生成翻译后的音频与匹配的口型动作。为训练此模块,我们利用生成模型本身合成同一发言者的多语言配对视频:首先生成包含单片段内语言切换的多语言视频,随后对每半段视频进行面部与音频修复,使其与另半段语言保持一致。通过发挥音视频模型丰富的生成先验优势,我们的方法在保持发言者身份特征与口型同步的同时,对复杂动作和真实场景动态具有强鲁棒性。实验表明,相较于现有配音流程,本方法生成的配音视频在视觉保真度、口型同步及鲁棒性方面均展现出更优品质。
English
Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.