UniAVGen:基于非对称跨模态交互的音频与视频统一生成
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
November 5, 2025
作者: Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang
cs.AI
摘要
由于缺乏有效的跨模态建模,现有开源音视频生成方法常存在唇形同步失真与语义一致性不足的问题。为克服这些缺陷,我们提出UniAVGen——一个面向音视频联合生成的统一框架。该框架采用双分支联合合成架构,通过两个并行的扩散变换器构建连贯的跨模态潜空间。其核心是非对称跨模态交互机制,该机制支持双向时序对齐的交叉注意力,从而确保精确的时空同步与语义一致性。此外,我们通过面部感知调制模块增强跨模态交互,该模块能动态聚焦交互过程中的显著性区域。为提升推理生成质量,我们额外引入模态感知的无分类器引导策略,这种新颖方法可显式增强跨模态关联信号。值得注意的是,UniAVGen强大的联合合成设计使其能以单一模型无缝整合关键音视频任务,包括音视频联合生成与续写、视频配音、音频驱动视频生成等。综合实验表明,在训练样本量显著减少的情况下(130万 vs 3010万),UniAVGen在音视频同步性、音色一致性与情感一致性方面均展现出整体优势。
English
Due to the lack of effective cross-modal modeling, existing open-source
audio-video generation methods often exhibit compromised lip synchronization
and insufficient semantic consistency. To mitigate these drawbacks, we propose
UniAVGen, a unified framework for joint audio and video generation. UniAVGen is
anchored in a dual-branch joint synthesis architecture, incorporating two
parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent
space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which
enables bidirectional, temporally aligned cross-attention, thus ensuring
precise spatiotemporal synchronization and semantic consistency. Furthermore,
this cross-modal interaction is augmented by a Face-Aware Modulation module,
which dynamically prioritizes salient regions in the interaction process. To
enhance generative fidelity during inference, we additionally introduce
Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly
amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint
synthesis design enables seamless unification of pivotal audio-video tasks
within a single model, such as joint audio-video generation and continuation,
video-to-audio dubbing, and audio-driven video synthesis. Comprehensive
experiments validate that, with far fewer training samples (1.3M vs. 30.1M),
UniAVGen delivers overall advantages in audio-video synchronization, timbre
consistency, and emotion consistency.