音频-语言模型是否在倾听?面向自适应音频引导的音频专家头部架构
Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering
March 6, 2026
作者: Neta Glazer, Lenny Aharon, Ethan Fetaya
cs.AI
摘要
多模态大语言模型常表现出文本主导性,过度依赖语言先验而非基于非文本输入进行预测。以大型音频语言模型(LALMs)为例,即便音频证据包含关键信息,其决定性作用仍可能未被充分利用。针对此问题,我们运用机制可解释性方法识别出一小组音频专家注意力头,其音频注意力可产生“聆听”信号。研究表明,当音频证据影响模型输出时该信号会增强,为标准提示下的音频参与度提供了指标。基于此定位,我们构建了音频-静默导向向量,并对最终表征实施推理时激活干预,从而放大模型的音频效应。为验证该干预的有效性,我们在MMAU数据集上证明:无需参数更新,此方法可使两种基于Qwen的LALMs准确率最高提升8.0个百分点。
English
Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.