ViCrit：面向视觉语言模型感知的可验证强化学习代理任务

摘要

強化學習（RL）在微調大型語言模型（LLMs）方面展現了顯著成效，尤其是在處理具有挑戰性且易於驗證的任務時，如數學推理或代碼生成。然而，將這一成功擴展到視覺-語言模型（VLMs）的視覺感知領域，卻因缺乏既具挑戰性又能明確驗證的視覺中心任務而受阻。為此，我們引入了ViCrit（視覺描述幻覺批評），這是一個RL代理任務，旨在訓練VLMs定位並識別人為撰寫的圖像描述段落中注入的細微、合成的視覺幻覺。從200字的描述開始，我們注入一個單一的、細微的視覺描述錯誤——改變對象、屬性、數量或空間關係的幾個詞——並要求模型根據圖像和修改後的描述精確定位被篡改的部分。這一設計保留了完整的感知難度，同時提供了一個易於計算且無歧義的二進制精確匹配獎勵。通過ViCrit任務訓練的模型在多種VL基準測試中取得了顯著提升。關鍵在於，這些改進不僅限於自然圖像訓練數據，還能遷移到抽象圖像推理和視覺數學，顯示出學習感知而非僅僅記憶所見對象的潛力。為了便於評估，我們進一步推出了ViCrit-Bench，這是一個類別平衡的診斷基準，系統性地探測跨多樣圖像領域和錯誤類型的感知錯誤。綜合來看，我們的結果表明，細粒度的幻覺批評是增強VLMs視覺感知的有效且可泛化的目標。

English

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

ViCrit：面向视觉语言模型感知的可验证强化学习代理任务

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

摘要

Support