ChatPaper.aiChatPaper

ViCrit:面向视觉语言模型感知的可验证强化学习代理任务

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

June 11, 2025
作者: Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang
cs.AI

摘要

強化學習(RL)在微調大型語言模型(LLMs)方面展現了顯著成效,尤其是在處理具有挑戰性且易於驗證的任務時,如數學推理或代碼生成。然而,將這一成功擴展到視覺-語言模型(VLMs)的視覺感知領域,卻因缺乏既具挑戰性又能明確驗證的視覺中心任務而受阻。為此,我們引入了ViCrit(視覺描述幻覺批評),這是一個RL代理任務,旨在訓練VLMs定位並識別人為撰寫的圖像描述段落中注入的細微、合成的視覺幻覺。從200字的描述開始,我們注入一個單一的、細微的視覺描述錯誤——改變對象、屬性、數量或空間關係的幾個詞——並要求模型根據圖像和修改後的描述精確定位被篡改的部分。這一設計保留了完整的感知難度,同時提供了一個易於計算且無歧義的二進制精確匹配獎勵。通過ViCrit任務訓練的模型在多種VL基準測試中取得了顯著提升。關鍵在於,這些改進不僅限於自然圖像訓練數據,還能遷移到抽象圖像推理和視覺數學,顯示出學習感知而非僅僅記憶所見對象的潛力。為了便於評估,我們進一步推出了ViCrit-Bench,這是一個類別平衡的診斷基準,系統性地探測跨多樣圖像領域和錯誤類型的感知錯誤。綜合來看,我們的結果表明,細粒度的幻覺批評是增強VLMs視覺感知的有效且可泛化的目標。
English
Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.
PDF192June 16, 2025