시각이 소리를 대변할 때

초록

비디오 이해가 가능한 다중모드 대규모 언어 모델(MLLM)의 급속한 발전에도 불구하고, 이러한 모델들의 동영상 내 오디오 이해는 종종 시각 중심적이라는 사실을 발견했다. 즉, 모델들은 오디오 스트림을 검증하기보다는 시각적 단서에 의존하여 청각 정보를 추론하거나 환각을 일으킨다. 이 문제는 최첨단 오픈소스 옴니 모델뿐만 아니라 Google 및 OpenAI와 같은 업체의 주요 클로즈드소스 모델에서도 나타난다. 우리는 이러한 실패 양상을 시청각적 클레버 한스(Clever Hans) 효과로 특징짓는데, 이는 모델이 (거짓으로) 오디오에 기반한 것처럼 보이지만 실제로는 오디오와 비디오 스트림이 진정으로 정렬되어 있는지 검증하지 않고 시각-청각 상관관계를 악용하는 경우를 말한다. 이러한 행동을 체계적으로 연구하기 위해 우리는 세 가지 반사실적 오디오 편집에 기반한 개입 중심 탐사 프레임워크인 Thud를 도입한다. Shift(시간적 동기화 테스트), Mute(소리 존재 여부 테스트), Swap(시청각 일관성 테스트)이 그것이다. 진단을 넘어, 우리는 2단계 정렬 레시피를 추가로 연구한다. 즉, 개입에서 파생된 선호 쌍은 오디오 검증을 학습시키고, 이벤트 수준의 일반 비디오 선호는 과도한 전문화에 대한 모델을 정규화한다. 최상의 10K 샘플 레시피는 세 가지 개입 차원에서 평균 성능을 28% 포인트 향상시키면서 일반 비디오 및 시청각 질의응답 벤치마크에서의 성능을 소폭 개선한다.

English

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.