시각 정보 이득을 통한 대규모 시각 언어 모델의 선택적 훈련

초록

대규모 시각 언어 모델(LVLM)은 놀라운 발전을 이루었으나, 시각적 증거에 의존하지 않고 답변을 생성하는 언어 편향 문제를 자주 겪습니다. 기존 연구에서는 디코딩 전략, 구조적 수정, 또는 선별된 지시 데이터를 통해 이 문제를 완화하려 시도했지만, 일반적으로 개별 훈련 샘플이나 토큰이 실제로 이미지로부터 얼마나 혜택을 받는지에 대한 정량적 측정이 부족했습니다. 본 연구에서는 시각적 입력이 제공하는 예측 불확실성 감소를 측정하는 perplexity 기반 지표인 시각 정보 이득(VIG)을 제안합니다. VIG는 샘플 및 토큰 수준에서 세분화된 분석을 가능하게 하여 색상, 공간 관계, 속성과 같은 시각적으로 근거 있는 요소를 효과적으로 부각합니다. 이를 활용하여 높은 VIG 값을 보이는 샘플과 토큰을 우선시하는 VIG 기반 선택적 훈련 방식을 제안합니다. 이 접근법은 시각적으로 정보성이 높은 샘플과 토큰에만 집중함으로써 시각적 근거 강화 및 언어 편향 완화를 달성하고, 감독 데이터를 크게 줄이면서도 우수한 성능을 얻습니다.

English

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

시각 정보 이득을 통한 대규모 시각 언어 모델의 선택적 훈련

Selective Training for Large Vision Language Models via Visual Information Gain

초록

Support