音频-语言模型在“听”吗？自适应音频引导的音频专家头机制

摘要

多模态大语言模型常表现出文本主导倾向，过度依赖语言先验而非基于非文本输入进行预测。以大型音频-语言模型（LALMs）为例，即便音频证据包含关键信息，其决定性作用也常被低估。为解决此问题，我们运用机制可解释性方法识别出一小组音频专家注意力头，其音频注意力可产生"聆听"信号。研究发现，当音频证据影响模型输出时该信号会增强，为标准提示下的音频参与度提供了指示指标。基于此定位，我们构建了音频-静默调控方向，并对最终表征实施推理时激活干预，从而放大模型的音频效应。为验证该干预的有效性，我们在MMAU基准测试中表明，该方法可使两个基于Qwen的LALMs准确率最高提升8.0个百分点，且无需任何参数更新。

English

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.

音频-语言模型在“听”吗？自适应音频引导的音频专家头机制

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

摘要

Support