视听大语言模型真的能看会听吗？

摘要

视听大语言模型（AVLLMs）正逐渐成为多模态感知的统一接口。我们首次对AVLLM展开机制可解释性研究，通过分析音频与视觉特征在模型不同层级的演化与融合过程，揭示其生成最终文本输出的内在机理。研究发现：尽管AVLLM在中间层编码了丰富的音频语义，但当音频与视觉信息冲突时，这些能力大多无法体现在最终文本生成中。探针分析表明，有效的潜在音频信息确实存在，但深层融合模块会过度偏向视觉表征，从而抑制音频线索的传递。我们进一步追溯这种失衡至训练阶段：AVLLM的音频处理模式与其视觉-语言基础模型高度吻合，表明其对音频监督信号的额外对齐有限。本研究揭示了AVLLM固有的模态偏好，为理解多模态大模型如何整合视听信息提供了新的机制性见解。

English

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.