透過對比嵌入鏈理解大型視覺語言模型的語言先驗

摘要

大型视觉语言模型（LVLMs）在多模态任务上展现出强大的性能，然而它们往往依赖于其语言先验（LP）——即预训练过程中记忆的文本模式，而未能充分利用视觉证据。先前对LP的分析大多依赖于输入输出探测，这种方法未能揭示视觉何时以及如何影响模型行为的内部机制。为填补这一空白，我们首次通过嵌入链的视角对语言先验进行了系统分析，考察了LVLMs内部各层的表示动态。我们的分析揭示了一个普遍现象：每个模型均表现出一个视觉整合点（VIP），这是一个关键层，在此层视觉信息开始有意义地重塑隐藏表示并影响解码过程。基于这一观察，我们引入了总视觉整合（TVI）估计器，该估计器通过聚合VIP之后的表示距离来量化视觉查询对响应生成的强烈程度。在涵盖9个当代LVLMs和6个基准测试的54个模型-数据集组合中，我们证明了VIP的一致出现，并且TVI能够可靠地预测语言先验的强度。这为诊断和理解LVLMs中的语言先验提供了一个原则性的工具包。

English

Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

透過對比嵌入鏈理解大型視覺語言模型的語言先驗

Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

摘要

Support