ViCrit: VLMsにおける視覚知覚のための検証可能な強化学習プロキシタスク

要旨

強化学習（RL）は、数学的推論やコード生成など、挑戦的でありながら容易に検証可能なタスクを用いて、大規模言語モデル（LLM）の微調整に大きな効果を示してきた。しかし、この成功を視覚言語モデル（VLMs）における視覚知覚に拡張することは、同時に挑戦的で曖昧さのない検証が可能な視覚中心のタスクの不足によって妨げられてきた。この問題に対処するため、我々はViCrit（Visual Caption Hallucination Critic）を導入する。これは、人間が書いた画像キャプションの段落に注入された微妙な合成視覚的幻覚をローカライズするようにVLMsを訓練するRLプロキシタスクである。200語のキャプションから始めて、単一の微妙な視覚的記述エラー（オブジェクト、属性、数、または空間関係に関するいくつかの単語を変更）を注入し、モデルに画像と修正されたキャプションを与えて、破損した範囲を特定するタスクを課す。この定式化は、完全な知覚的難易度を維持しながら、計算が容易で曖昧さのない二値の完全一致報酬を提供する。ViCritタスクで訓練されたモデルは、さまざまなVLベンチマークで大幅な向上を示す。重要なことに、改善は自然画像の訓練データを超えて抽象画像推論や視覚的数学に転移し、見た物体を単に記憶するのではなく、知覚することを学習する可能性を示している。評価を容易にするため、我々はさらにViCrit-Benchを導入する。これは、多様な画像ドメインとエラータイプにわたって知覚エラーを体系的に探るカテゴリーバランスの取れた診断ベンチマークである。全体として、我々の結果は、細かい幻覚批評がVLMsにおける視覚知覚を強化するための効果的で一般化可能な目的であることを示している。

English

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

ViCrit: VLMsにおける視覚知覚のための検証可能な強化学習プロキシタスク

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

要旨

Support