ViCrit:面向视觉语言模型感知的可验证强化学习代理任务
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs
June 11, 2025
作者: Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang
cs.AI
摘要
強化學習(RL)在微調大型語言模型(LLMs)方面展現了顯著成效,尤其是在處理具有挑戰性且易於驗證的任務時,如數學推理或代碼生成。然而,將這一成功擴展到視覺-語言模型(VLMs)的視覺感知領域,卻因缺乏既具挑戰性又能明確驗證的視覺中心任務而受阻。為此,我們引入了ViCrit(視覺描述幻覺批評),這是一個RL代理任務,旨在訓練VLMs定位並識別人為撰寫的圖像描述段落中注入的細微、合成的視覺幻覺。從200字的描述開始,我們注入一個單一的、細微的視覺描述錯誤——改變對象、屬性、數量或空間關係的幾個詞——並要求模型根據圖像和修改後的描述精確定位被篡改的部分。這一設計保留了完整的感知難度,同時提供了一個易於計算且無歧義的二進制精確匹配獎勵。通過ViCrit任務訓練的模型在多種VL基準測試中取得了顯著提升。關鍵在於,這些改進不僅限於自然圖像訓練數據,還能遷移到抽象圖像推理和視覺數學,顯示出學習感知而非僅僅記憶所見對象的潛力。為了便於評估,我們進一步推出了ViCrit-Bench,這是一個類別平衡的診斷基準,系統性地探測跨多樣圖像領域和錯誤類型的感知錯誤。綜合來看,我們的結果表明,細粒度的幻覺批評是增強VLMs視覺感知的有效且可泛化的目標。
English
Reinforcement learning (RL) has shown great effectiveness for fine-tuning
large language models (LLMs) using tasks that are challenging yet easily
verifiable, such as math reasoning or code generation. However, extending this
success to visual perception in vision-language models (VLMs) has been impeded
by the scarcity of vision-centric tasks that are simultaneously challenging and
unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption
Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle,
synthetic visual hallucination injected into paragraphs of human-written image
captions. Starting from a 200-word captions, we inject a single, subtle visual
description error-altering a few words on objects, attributes, counts, or
spatial relations-and task the model to pinpoint the corrupted span given the
image and the modified caption. This formulation preserves the full perceptual
difficulty while providing a binary, exact-match reward that is easy to compute
and unambiguous. Models trained with the ViCrit Task exhibit substantial gains
across a variety of VL benchmarks. Crucially, the improvements transfer beyond
natural-image training data to abstract image reasoning and visual math,
showing promises of learning to perceive rather than barely memorizing seen
objects. To facilitate evaluation, we further introduce ViCrit-Bench, a
category-balanced diagnostic benchmark that systematically probes perception
errors across diverse image domains and error types. Together, our results
demonstrate that fine-grained hallucination criticism is an effective and
generalizable objective for enhancing visual perception in VLMs.