ChatPaper.aiChatPaper

听觉能否辅助视觉?视频生成中音视频联合降噪技术探究

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

December 2, 2025
作者: Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang
cs.AI

摘要

近期音视频生成系统的研究表明,模态耦合不仅能提升音视频同步性,更能增强视频模态本身的表现。我们提出一个根本性问题:即使仅关注视频质量,音视频联合去噪训练是否能提升视频生成效果?为探究此问题,我们提出了参数高效的音视频全DiT架构(AVFullDiT),该架构利用预训练的文本到视频(T2V)和文本到音频(T2A)模块进行联合去噪训练。我们在相同设置下分别训练了:(i)采用AVFullDiT的T2AV模型;(ii)仅使用视频模态的对照模型。研究结果首次系统性地证明,音视频联合去噪能带来超越同步性提升的收益。在包含大幅度和物体接触运动的挑战性数据子集上,我们观察到一致的性能提升。我们推测音频预测作为一种特权信号,能促使模型内化视觉事件与其声学后果之间的因果关系(例如碰撞时机对声音的影响),从而对视频动态特性产生正则化作用。我们的发现表明,跨模态协同训练是开发更强大、更符合物理规律的世界模型的有效途径。代码与数据集将公开发布。
English
Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision times impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.
PDF101December 4, 2025