基于视觉信息增益的大型视觉语言模型选择性训练方法

摘要

大型视觉语言模型（LVLM）虽已取得显著进展，但常因语言偏见问题而脱离视觉证据生成答案。现有研究虽尝试通过解码策略、架构调整或精选指令数据来缓解此问题，但普遍缺乏对单个训练样本或词元实际从图像中获益程度的量化评估。本研究提出基于困惑度的视觉信息增益（VIG）指标，通过测量视觉输入带来的预测不确定性降低程度，能在样本和词元层面实现细粒度分析，有效凸显色彩、空间关系及属性等视觉接地元素。基于此，我们设计出VIG引导的选择性训练方案，优先训练高VIG值的样本与词元。该方法通过聚焦于具视觉信息量的内容，在显著减少监督数据量的同时提升视觉接地性、缓解语言偏见，最终实现更优性能。

English

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

基于视觉信息增益的大型视觉语言模型选择性训练方法

Selective Training for Large Vision Language Models via Visual Information Gain

摘要

Support