音频-语言模型是否在倾听？面向自适应音频引导的音频专家头部架构

摘要

多模态大语言模型常表现出文本主导性，过度依赖语言先验而非基于非文本输入进行预测。以大型音频语言模型（LALMs）为例，即便音频证据包含关键信息，其决定性作用仍可能未被充分利用。针对此问题，我们运用机制可解释性方法识别出一小组音频专家注意力头，其音频注意力可产生“聆听”信号。研究表明，当音频证据影响模型输出时该信号会增强，为标准提示下的音频参与度提供了指标。基于此定位，我们构建了音频-静默导向向量，并对最终表征实施推理时激活干预，从而放大模型的音频效应。为验证该干预的有效性，我们在MMAU数据集上证明：无需参数更新，此方法可使两种基于Qwen的LALMs准确率最高提升8.0个百分点。

English

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.

音频-语言模型是否在倾听？面向自适应音频引导的音频专家头部架构

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

摘要

Support