基于视觉信息增益的大型视觉语言模型选择性训练

摘要

大型视觉语言模型（LVLM）已取得显著进展，但其常受语言偏见影响，导致答案生成未依赖视觉证据。尽管先前研究尝试通过解码策略、架构调整或精选指令数据来缓解此问题，但通常缺乏对单个训练样本或词元实际从图像中获益程度的量化评估。本研究提出视觉信息增益（VIG）——一种基于困惑度的指标，用于衡量视觉输入带来的预测不确定性降低程度。VIG支持在样本和词元级别进行细粒度分析，能有效突显颜色、空间关系和属性等视觉基础元素。基于此，我们提出VIG引导的选择性训练方案，优先处理高VIG值的样本和词元。该方法通过专注视觉信息丰富的样本与词元，在显著减少监督量的同时提升视觉基础能力、缓解语言偏见，最终实现更优性能。

English

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.