ChatPaper.aiChatPaper

和諧:透過跨任務協同實現音視頻生成的統一

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

November 26, 2025
作者: Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
cs.AI

摘要

同步視聽內容的合成是生成式AI領域的關鍵挑戰,開源模型在實現穩健的視聽對齊方面面臨諸多難題。我們的分析表明,該問題根源於聯合擴散過程的三個根本性挑戰:(1)對應漂移問題:同步演化的噪聲潛在表徵阻礙了對齊關係的穩定學習;(2)低效的全局注意力機制:難以捕捉細粒度時間線索;(3)傳統無分類器引導(CFG)的模內偏差:雖能增強條件控制,卻無法改善跨模態同步。為解決這些挑戰,我們提出Harmony框架,通過機制化設計強制實現視聽同步。首先提出跨任務協同訓練範式,利用音頻驅動視頻生成與視頻驅動音頻生成的強監督信號抑制漂移現象;其次設計全局-局部解耦交互模塊,實現高效精準的時序風格對齊;最後創新性提出同步增強型CFG(SyncCFG),在推理階段顯式分離並放大對齊信號。大量實驗表明,Harmony在生成保真度與細粒度視聽同步方面均顯著超越現有方法,確立了新一代技術標杆。
English
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
PDF203December 1, 2025