連鎖的埋め込みの対比によるLVLMの言語事前知識の理解

要旨

大規模視覚言語モデル（LVLM）はマルチモーダルタスクにおいて高い性能を発揮するが、視覚的証拠を十分に活用せず、事前学習で記憶されたテキストパターンである言語事前分布（LP）に依存する傾向がある。これまでのLPの分析は主に入力-出力プロービングに依存しており、視覚がモデルの行動にいつ、どのように影響を与えるかを支配する内部メカニズムを明らかにすることに失敗していた。このギャップを埋めるため、我々はチェーン・オブ・エンベディングの観点から言語事前分布を体系的に分析する初めての研究を提示する。この分析では、層ごとの表現ダイナミクスを調査し、各モデルが視覚情報が隠れ表現を意味的に再形成し、デコードに影響を与え始める重要な層である視覚統合ポイント（VIP）を示す普遍的な現象を明らかにした。この観察に基づき、我々はVIPを超えた表現距離を集約し、視覚クエリが応答生成にどの程度強く影響を与えるかを定量化する総合視覚統合（TVI）推定器を導入した。9つの現代的なLVLMと6つのベンチマークにまたがる54のモデル-データセットの組み合わせにおいて、VIPが一貫して現れ、TVIが言語事前分布の強度を信頼性高く予測することを実証した。これにより、LVLMにおける言語事前分布を診断し理解するための原則に基づいたツールキットが提供される。

English

Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

連鎖的埋め込みの対比によるLVLMの言語事前知識の理解

Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

要旨

Support