ViCrit:面向视觉语言模型感知的可验证强化学习代理任务
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
June 11, 2025
作者: Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang
cs.AI
摘要
强化学习(RL)在利用数学推理或代码生成等既具挑战性又易于验证的任务来微调大型语言模型(LLMs)方面已展现出显著成效。然而,将这一成功扩展至视觉-语言模型(VLMs)的视觉感知领域,却因缺乏既具挑战性又能明确验证的视觉中心任务而受阻。为此,我们引入了ViCrit(视觉描述幻觉批评器),这是一个RL代理任务,旨在训练VLMs定位被注入人类撰写图像描述段落中的微妙合成视觉幻觉。从一段200字的描述开始,我们注入一个细微的视觉描述错误——改变对象、属性、数量或空间关系中的几个词——并让模型根据图像和修改后的描述精确定位被篡改的部分。这一设定保留了完整的感知难度,同时提供了易于计算且无歧义的二元精确匹配奖励。通过ViCrit任务训练的模型在多种VL基准测试中展现出显著提升。尤为关键的是,这些改进不仅限于自然图像训练数据,还能迁移到抽象图像推理和视觉数学领域,显示出学习感知而非仅仅记忆所见对象的潜力。为便于评估,我们进一步推出了ViCrit-Bench,这是一个类别平衡的诊断基准,系统性地探究了跨多样图像领域和错误类型的感知错误。综合来看,我们的结果表明,细粒度的幻觉批评是增强VLMs视觉感知的有效且可推广的目标。
English
Reinforcement learning (RL) has shown great effectiveness for fine-tuning
large language models (LLMs) using tasks that are challenging yet easily
verifiable, such as math reasoning or code generation. However, extending this
success to visual perception in vision-language models (VLMs) has been impeded
by the scarcity of vision-centric tasks that are simultaneously challenging and
unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption
Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle,
synthetic visual hallucination injected into paragraphs of human-written image
captions. Starting from a 200-word captions, we inject a single, subtle visual
description error-altering a few words on objects, attributes, counts, or
spatial relations-and task the model to pinpoint the corrupted span given the
image and the modified caption. This formulation preserves the full perceptual
difficulty while providing a binary, exact-match reward that is easy to compute
and unambiguous. Models trained with the ViCrit Task exhibit substantial gains
across a variety of VL benchmarks. Crucially, the improvements transfer beyond
natural-image training data to abstract image reasoning and visual math,
showing promises of learning to perceive rather than barely memorizing seen
objects. To facilitate evaluation, we further introduce ViCrit-Bench, a
category-balanced diagnostic benchmark that systematically probes perception
errors across diverse image domains and error types. Together, our results
demonstrate that fine-grained hallucination criticism is an effective and
generalizable objective for enhancing visual perception in VLMs.