听觉能否辅助视觉？视频生成中音视频联合降噪技术探究

摘要

近期音视频生成系统研究表明，模态耦合不仅能提升音视频同步性，更能增强视频模态本身的生成质量。我们提出一个基础性问题：即使仅关注视频质量，音视频联合去噪训练是否也能改善视频生成效果？为探究此问题，我们设计了参数高效的音视频全扩散变换器（AVFullDiT）架构，该架构利用预训练的文本-视频（T2V）与文本-音频（T2A）模块进行联合去噪。我们在相同设置下分别训练了：（i）采用AVFullDiT的T2AV模型；（ii）仅使用视频模态的对照模型。实验结果首次系统性地证明，音视频联合去噪能带来超越同步性提升的收益。在包含大幅物体运动和物体接触动作的挑战性数据子集上，我们观察到一致性的视频质量改进。我们推测音频预测作为一种特权信号，能促使模型内化视觉事件与其声学后果之间的因果关系（例如碰撞时机对声音的影响），从而对视频动态规律形成正则化约束。本研究结果表明，跨模态协同训练是开发更强大、更符合物理规律的世界模型的有效途径。相关代码与数据集将公开提供。

English

Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision times impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

听觉能否辅助视觉？视频生成中音视频联合降噪技术探究

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

摘要

Support