视听大语言模型真的能看会听吗？

摘要

视听大语言模型（AVLLMs）正逐渐成为多模态感知的统一接口。我们首次对AVLLMs展开机制可解释性研究，通过分析音频与视觉特征在模型不同层级的演化与融合过程，揭示其生成最终文本输出的内在机制。研究发现：尽管中间层编码了丰富的音频语义，但当音频与视觉信息冲突时，这些能力难以体现在最终文本生成中。探针分析表明，虽然潜在音频信息实际存在，但深层融合模块会过度偏向视觉表征，导致音频线索被系统性抑制。我们进一步追溯这种失衡至训练过程：AVLLMs的音频响应模式与其视觉-语言基础模型高度吻合，表明模型未能充分对齐音频监督信号。本研究揭示了AVLLMs固有的模态偏好现象，为理解多模态大模型如何整合视听信息提供了新的机制性见解。

English

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

视听大语言模型真的能看会听吗？

Do Audio-Visual Large Language Models Really See and Hear?

摘要

Support