當視覺為聲音代言
When Vision Speaks for Sound
May 13, 2026
作者: Xiaofei Wen, Wenjie Jacky Mo, Xingyu Fu, Rui Cai, Tinghui Zhu, Wendi Li, Yanan Xie, Muhao Chen, Peng Qi
cs.AI
摘要
儘管視覺功能的多模態大型語言模型(MLLMs)進展迅速,我們發現它們在影片中的音訊理解往往由視覺驅動:模型依賴視覺線索推測或幻覺聲學資訊,而非實際驗證音訊串流。這個問題同時出現在最先進的開源全功能模型,以及來自 Google 和 OpenAI 等領先封閉源模型。我們將此失敗模式定性為「視聽聰明的漢斯效應」:模型看似基於音訊進行理解,實則利用視覺與聲學的相關性,卻不驗證視覺與音訊串流是否真正對齊。為系統性研究此行為,我們提出 Thud,一個基於干預的探測框架,包含三種反事實音訊編輯:移位(Shift)測試時間同步性;靜音(Mute)測試聲音存在性;交換(Swap)測試視聽一致性。除診斷外,我們進一步研究一個兩階段對齊配方:由干預生成的偏好配對教導音訊驗證,而事件層級的一般影片偏好則規範模型避免過度專門化。我們最佳的 10K 樣本配方在三種干預維度上的平均表現提升 28 個百分點,同時在一般影片及視聽問答基準上略有進步。
English
Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.