ChatPaper.aiChatPaper

音频-语言模型在“听”吗?自适应音频引导的音频专家头机制

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

March 6, 2026
作者: Neta Glazer, Lenny Aharon, Ethan Fetaya
cs.AI

摘要

多模态大语言模型常表现出文本主导倾向,过度依赖语言先验而非基于非文本输入进行预测。以大型音频-语言模型(LALMs)为例,即便音频证据包含关键信息,其决定性作用也常被低估。为解决此问题,我们运用机制可解释性方法识别出一小组音频专家注意力头,其音频注意力可产生"聆听"信号。研究发现,当音频证据影响模型输出时该信号会增强,为标准提示下的音频参与度提供了指示指标。基于此定位,我们构建了音频-静默调控方向,并对最终表征实施推理时激活干预,从而放大模型的音频效应。为验证该干预的有效性,我们在MMAU基准测试中表明,该方法可使两个基于Qwen的LALMs准确率最高提升8.0个百分点,且无需任何参数更新。
English
Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.
PDF91March 12, 2026