音声言語モデルは「聞いている」のか？適応的音声制御のための音声専門ヘッド

要旨

マルチモーダル大規模言語モデルは、非テキスト入力を適切に根拠とせず、言語的な事前知識に過度に依存する「テキスト優位性」を示すことがある。一例として、大規模音声言語モデル（LALM）では、決定的な音声証拠に重要な情報が含まれている場合でも、それが十分に活用されないことがある。この問題に対処するため、我々は機械論的解釈可能性を用い、音声注意が「聴取」信号を生み出す少数の音声特化アテンションヘッドを特定した。この信号は、音声証拠がモデルの出力に影響を与える際に増大し、標準的なプロンプト下での音声関与の指標となることを示す。この局在化を活用し、音声-無音ステアリング方向を構築し、最終表現に対して推論時の活性化介入を適用することで、モデルの音声影響効果を増幅する。この介入の有用性を実証するため、MMAUデータセットにおいて、パラメータ更新を一切行わずに、Qwenベースの2つのLALMで精度を最大+8.0ポイント向上させることを示す。

English

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.

音声言語モデルは「聞いている」のか？適応的音声制御のための音声専門ヘッド

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

要旨

Support