Wanneer het Zicht voor het Geluid Spreekt

Samenvatting

Ondanks de snelle vooruitgang in videogeschikte MLLM's, ontdekken we dat hun schijnbare audiobegrip in video's vaak visueel gestuurd is: modellen vertrouwen op visuele aanwijzingen om akoestische informatie af te leiden of te hallucineren, in plaats van de audiostroom te verifiëren. Dit probleem doet zich voor bij zowel state-of-the-art open-source omnimodellen als bij toonaangevende closed-source modellen van aanbieders zoals Google en OpenAI. We karakteriseren deze falende modus als een audio-visueel Clever Hans-effect, waarbij modellen (ten onrechte) lijken te zijn verankerd in audio, maar in werkelijkheid visueel-akoestische correlaties uitbuiten zonder te verifiëren of de audio- en visuele stromen werkelijk op elkaar zijn afgestemd. Om dit gedrag systematisch te bestuderen, introduceren we Thud, een interventiegestuurd toetsingskader dat is gebaseerd op drie contrafactische audiobewerkingen: Shift, dat de temporele synchronisatie test; Mute, dat het bestaan van geluid test; en Swap, dat de audio-visuele consistentie test. Naast diagnose bestuderen we verder een tweefasig afstemmingsrecept: interventie-afgeleide voorkeursparen leren audioverificatie, terwijl gebeurtenisniveau algemene videovoorkeuren het model reguleren tegen overspecialisatie. Ons beste recept met 10K samples verbetert de gemiddelde prestatie over de drie interventiedimensies met 28 procentpunten, terwijl het de prestatie op algemene video- en audio-visuele QA-benchmarks licht verbetert.

English

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.