當視覺為聲音代言

摘要

儘管視覺功能的多模態大型語言模型（MLLMs）進展迅速，我們發現它們在影片中的音訊理解往往由視覺驅動：模型依賴視覺線索推測或幻覺聲學資訊，而非實際驗證音訊串流。這個問題同時出現在最先進的開源全功能模型，以及來自 Google 和 OpenAI 等領先封閉源模型。我們將此失敗模式定性為「視聽聰明的漢斯效應」：模型看似基於音訊進行理解，實則利用視覺與聲學的相關性，卻不驗證視覺與音訊串流是否真正對齊。為系統性研究此行為，我們提出 Thud，一個基於干預的探測框架，包含三種反事實音訊編輯：移位（Shift）測試時間同步性；靜音（Mute）測試聲音存在性；交換（Swap）測試視聽一致性。除診斷外，我們進一步研究一個兩階段對齊配方：由干預生成的偏好配對教導音訊驗證，而事件層級的一般影片偏好則規範模型避免過度專門化。我們最佳的 10K 樣本配方在三種干預維度上的平均表現提升 28 個百分點，同時在一般影片及視聽問答基準上略有進步。

English

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.