JUST-DUB-IT：基于视听联合扩散模型的视频配音技术

摘要

音视频基础模型通过预训练实现声音与视觉内容的联合生成，近期展现出前所未有的多模态生成与编辑能力，为下游任务开辟了新机遇。在视频配音任务中，此类先验知识可发挥重要作用，但现有解决方案仍依赖复杂且针对特定任务的流程，难以应对实际场景的挑战。本研究提出一种单模型解决方案，通过轻量级LoRA适配基础音视频扩散模型，实现视频到视频的配音功能。该LoRA模块使模型能够以输入音视频为条件，同步生成翻译后的音频与协调的面部动作。为训练此LoRA模块，我们利用生成模型本身合成同一发言者的多语言配对视频：首先生成包含单镜头内语言切换的多语言视频，随后对每个半区进行面部与音频修复以匹配另半区的语言。通过发挥音视频模型丰富的生成先验优势，我们的方法在保持说话者身份特征与唇形同步的同时，对复杂动作和真实场景动态具有强鲁棒性。实验表明，相较于现有配音流程，本方法能生成视觉保真度更高、唇形同步更精准且鲁棒性更强的优质配音视频。

English

Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half. By leveraging the rich generative prior of the audio-visual model, our approach preserves speaker identity and lip synchronization while remaining robust to complex motion and real-world dynamics. We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.

JUST-DUB-IT：基于视听联合扩散模型的视频配音技术

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

摘要

Support