基於視覺推理的紮根強化學習
Grounded Reinforcement Learning for Visual Reasoning
May 29, 2025
作者: Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki
cs.AI
摘要
儘管基於思維鏈的強化學習(RL)在數學和編碼等任務中顯著提升了語言模型的能力,視覺推理卻因其要求模型引導視覺注意力、解讀感知輸入並將抽象推理建立在空間證據之上,而引入了額外的複雜性。我們提出了ViGoRL(視覺基礎強化學習),這是一種通過RL訓練的視覺語言模型,旨在將每個推理步驟明確地錨定於特定的視覺座標。受人類視覺決策過程的啟發,ViGoRL學會生成基於空間的推理軌跡,在每一步引導視覺注意力至任務相關區域。當需要細粒度探索時,我們新穎的多輪RL框架使模型能夠在推理過程中動態放大至預測座標。在一系列視覺推理基準測試中——包括用於空間推理的SAT-2和BLINK,用於視覺搜索的V*bench,以及用於基於網絡的基礎測試的ScreenSpot和VisualWebArena——ViGoRL始終優於缺乏明確基礎機制的監督微調和傳統RL基線。將多輪RL與放大視覺反饋相結合,顯著提升了ViGoRL在定位小型GUI元素和視覺搜索方面的性能,在V*Bench上達到了86.4%的準確率。此外,我們發現基礎化增強了其他視覺行為,如區域探索、基礎子目標設定和視覺驗證。最後,人類評估顯示,模型的視覺參考不僅在空間上準確,而且有助於理解模型的推理步驟。我們的結果表明,視覺基礎的RL是一種強大的範式,能夠賦予模型通用視覺推理能力。
English
While reinforcement learning (RL) over chains of thought has significantly
advanced language models in tasks such as mathematics and coding, visual
reasoning introduces added complexity by requiring models to direct visual
attention, interpret perceptual inputs, and ground abstract reasoning in
spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement
Learning), a vision-language model trained with RL to explicitly anchor each
reasoning step to specific visual coordinates. Inspired by human visual
decision-making, ViGoRL learns to produce spatially grounded reasoning traces,
guiding visual attention to task-relevant regions at each step. When
fine-grained exploration is required, our novel multi-turn RL framework enables
the model to dynamically zoom into predicted coordinates as reasoning unfolds.
Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK
for spatial reasoning, V*bench for visual search, and ScreenSpot and
VisualWebArena for web-based grounding--ViGoRL consistently outperforms both
supervised fine-tuning and conventional RL baselines that lack explicit
grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual
feedback significantly improves ViGoRL's performance on localizing small GUI
elements and visual search, achieving 86.4% on V*Bench. Additionally, we find
that grounding amplifies other visual behaviors such as region exploration,
grounded subgoal setting, and visual verification. Finally, human evaluations
show that the model's visual references are not only spatially accurate but
also helpful for understanding model reasoning steps. Our results show that
visually grounded RL is a strong paradigm for imbuing models with
general-purpose visual reasoning.