音声視覚大規模言語モデルは本当に見て聞いているのか？

要旨

音声視覚大規模言語モデル（AVLLM）はマルチモーダル知覚の統一インターフェースとして登場しつつある。本論文はAVLLM初のメカニズム解釈可能性研究を提示し、音声・視覚特徴がAVLLMの各層を通じて如何に進化・融合し最終テキスト出力を生成するかを分析する。中間層では豊富な音声意味情報が符号化されるものの、音声と視覚が矛盾する場合、これらの能力は最終テキスト生成にほとんど反映されないことが明らかになった。プロービング分析により、有用な潜在音声情報は存在するが、深層融合層では視覚表現が不均衡に優先され、音声手がかりが抑制される傾向が確認された。この不均衡は学習過程に起因し、AVLLMの音声処理挙動は視覚言語基盤モデルと強く一致しており、音声監督への追加的アライメントが限定的であることを示唆する。我々の発見はAVLLMに内在する根本的なモダリティバイアスを明らかにし、マルチモーダルLLMが音声と視覚を統合するメカニズムに関する新たな知見を提供する。

English

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

音声視覚大規模言語モデルは本当に見て聞いているのか？

Do Audio-Visual Large Language Models Really See and Hear?

要旨

Support