視覚が音に代わって語るとき

要旨

ビデオ対応MLLMの急速な進歩にもかかわらず、それらの映像内における音声理解は視覚に依存していることが多いという課題が見られる。すなわち、モデルは音声ストリームを検証するのではなく、視覚的な手がかりから音響情報を推測または幻覚しているのである。この問題は、最先端のオープンソースのオムニモデルだけでなく、GoogleやOpenAIなどの主要なクローズドソースモデルにも見られる。我々はこの障害モードを、音声-視覚の賢いハンス効果として特徴づける。これは、モデルが（誤って）音声に基づいているように見えるものの、実際には音声と視覚ストリームが本当に一致しているかを検証せずに、視覚-音響の相関を利用している現象である。この行動を体系的に研究するため、我々はThudを導入する。これは、3つの反実仮想的な音声編集に基づく介入駆動型のプロービングフレームワークであり、時間的同期をテストするShift、音の存在をテストするMute、そして音声-視覚の一致性をテストするSwapから構成される。診断に加えて、我々はさらに2段階のアライメント手法を研究する。介入から得られた選好ペアは音声検証を教え、イベントレベルの一般的なビデオ選好は過特化に対する正則化として機能する。我々の最良の10Kサンプル手法は、3つの介入次元にわたる平均性能を28パーセントポイント向上させると同時に、一般的なビデオおよび音声-視覚QAベンチマークの性能をわずかに改善する。

English

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.