和谐：通过跨任务协同实现音视频生成的统一

摘要

同步音视频内容的合成是生成式人工智能的核心挑战，开源模型在实现鲁棒的音视频对齐方面面临诸多困难。我们的分析表明，该问题源于联合扩散过程的三个根本性挑战：（1）对应关系漂移——并发演化的噪声潜在表征阻碍了对齐关系的稳定学习；（2）低效的全局注意力机制难以捕捉细粒度时序线索；（3）传统无分类器引导（CFG）的模态内偏差，虽能增强条件性却无法促进跨模态同步。为突破这些限制，我们提出创新框架Harmony，通过机制化设计强制实现音视频同步。我们首先提出跨任务协同训练范式，通过融合音频驱动视频生成与视频驱动音频生成任务中的强监督信号来抑制漂移现象；继而设计全局-局部解耦交互模块，实现高效精准的时序-风格对齐；最后提出同步增强型CFG（SyncCFG），在推理阶段显式分离并放大对齐信号。大量实验表明，Harmony开创了全新标杆，在生成保真度及关键的细粒度音视频同步效果上均显著超越现有方法。

English

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.