UniAVGen:透過非對稱跨模態互動實現音視頻統一生成
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
November 5, 2025
作者: Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang
cs.AI
摘要
由於缺乏有效的跨模態建模,現有開源音視頻生成方法常存在唇形同步失真與語義一致性不足的問題。為解決這些缺陷,我們提出UniAVGen——一個面向音視頻聯合生成的統一框架。該框架採用雙分支聯合合成架構,通過兩個並行的擴散轉換器構建連貫的跨模態潛在空間。其核心在於非對稱跨模態交互機制,該機制實現雙向且時序對齊的跨注意力計算,從而確保精確的時空同步與語義一致性。此外,我們通過面部感知調製模塊增強跨模態交互,該模塊能動態優化交互過程中的顯著區域權重。為提升推理階段的生成保真度,我們進一步提出模態感知分類器自由引導策略,這種新方法能顯著強化跨模態關聯信號。值得注意的是,UniAVGen的強健聯合合成設計使其能將關鍵音視頻任務無縫整合於單一模型,包括音視頻聯合生成與續寫、視頻到音頻的配音以及音頻驅動的視頻生成。綜合實驗表明,在訓練樣本量大幅減少的情況下(130萬對比3010萬),UniAVGen在音視頻同步性、音色一致性與情感一致性方面均展現出綜合優勢。
English
Due to the lack of effective cross-modal modeling, existing open-source
audio-video generation methods often exhibit compromised lip synchronization
and insufficient semantic consistency. To mitigate these drawbacks, we propose
UniAVGen, a unified framework for joint audio and video generation. UniAVGen is
anchored in a dual-branch joint synthesis architecture, incorporating two
parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent
space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which
enables bidirectional, temporally aligned cross-attention, thus ensuring
precise spatiotemporal synchronization and semantic consistency. Furthermore,
this cross-modal interaction is augmented by a Face-Aware Modulation module,
which dynamically prioritizes salient regions in the interaction process. To
enhance generative fidelity during inference, we additionally introduce
Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly
amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint
synthesis design enables seamless unification of pivotal audio-video tasks
within a single model, such as joint audio-video generation and continuation,
video-to-audio dubbing, and audio-driven video synthesis. Comprehensive
experiments validate that, with far fewer training samples (1.3M vs. 30.1M),
UniAVGen delivers overall advantages in audio-video synchronization, timbre
consistency, and emotion consistency.