LVLM의 언어 사전 이해: 체인-오브-임베딩 대조를 통한 접근

초록

대규모 시각-언어 모델(LVLMs)은 멀티모달 작업에서 강력한 성능을 보이지만, 종종 사전 학습에서 기억된 텍스트 패턴인 언어 사전(LP)에 의존하며 시각적 증거를 충분히 활용하지 못합니다. 기존의 LP 분석은 주로 입력-출력 프로빙에 의존했는데, 이는 시각 정보가 모델 행동에 영향을 미치는 시점과 방식을 규제하는 내부 메커니즘을 밝히지 못합니다. 이러한 격차를 해결하기 위해, 우리는 체인-오브-임베딩(chain-of-embedding) 관점에서 언어 사전에 대한 첫 번째 체계적인 분석을 제시합니다. 이는 LVLMs 내의 계층별 표현 동역학을 조사합니다. 우리의 분석은 보편적인 현상을 밝혀냅니다: 각 모델은 시각 정보가 은닉 표현을 의미 있게 재구성하고 디코딩에 영향을 미치기 시작하는 중요한 계층인 시각 통합 지점(VIP)을 보입니다. 이 관찰을 바탕으로, 우리는 VIP를 넘어 표현 거리를 집계하여 시각적 쿼리가 응답 생성에 얼마나 강하게 영향을 미치는지 정량화하는 총 시각 통합(TVI) 추정기를 도입합니다. 9개의 현대적 LVLMs와 6개의 벤치마크를 아우르는 54개의 모델-데이터셋 조합에서, VIP가 일관되게 나타나며 TVI가 언어 사전의 강도를 신뢰롭게 예측함을 입증합니다. 이는 LVLMs에서 언어 사전을 진단하고 이해하기 위한 원칙적인 도구를 제공합니다.

English

Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

LVLM의 언어 사전 이해: 체인-오브-임베딩 대조를 통한 접근

Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

초록

Support