当视觉为声音代言
When Vision Speaks for Sound
May 13, 2026
作者: Xiaofei Wen, Wenjie Jacky Mo, Xingyu Fu, Rui Cai, Tinghui Zhu, Wendi Li, Yanan Xie, Muhao Chen, Peng Qi
cs.AI
摘要
尽管视频多模态大语言模型(MLLMs)发展迅猛,但我们发现其视频中的音频理解能力往往由视觉驱动:模型依赖视觉线索来推断或幻觉声学信息,而非验证音频流。这一问题在目前最先进的开源全能模型以及谷歌和OpenAI等机构推出的领先闭源模型中均有体现。我们将这种故障模式定义为音视“聪明的汉斯效应”——模型看似基于音频判断,实则利用视觉-声学相关性,而不验证音视频流是否真正对齐。为系统研究该行为,我们提出Thud框架:一种基于三种反事实音频编辑的干预驱动探测框架——Shift(测试时间同步性)、Mute(测试声音存在性)和Swap(测试音视频一致性)。除了诊断,我们进一步探讨了两阶段对齐策略:干预生成的偏好对教会模型进行音频验证,而事件级通用视频偏好则防止模型过度特化。我们最优的10K样本策略在三个干预维度上的平均性能提升了28个百分点,同时在通用视频及音视频问答基准上略微提升了表现。
English
Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.