強化工具輔助的視覺感知

摘要

視覺推理作為人類智能的基石，涵蓋了解決多樣化視覺問題所必需的複雜感知與邏輯過程。儘管計算機視覺的進步已催生出適用於各種感知任務的強大模型，但將這些模型應用於通用視覺推理仍面臨挑戰。先前的研究表明，通過監督式微調將視覺模型與大型語言模型（LLMs）結合能提升性能，但存在數據生成成本高、依賴於精細的數據篩選以及泛化能力差等關鍵限制。為解決這些問題，我們提出了ReVPT，旨在通過強化學習增強多模態LLMs在視覺工具使用與推理方面的能力。我們基於GRPO引入了一種新穎的強化學習算法，專門訓練模型以運用一套包含四種視覺工具進行推理。通過大量實驗，我們展示了該方法在包括SAT、CV-Bench、BLINK和MMStar在內的多個感知密集型基準測試上達到了業界領先水平，顯著超越了監督式及基於文本的強化學習微調基線。值得注意的是，我們的ReVPT-3B和ReVPT-7B在CV-Bench上分別以9.03%和9.44%的優勢超越了指導模型。最後，我們通過廣泛的消融實驗，為社區帶來了關於基於強化學習的視覺工具使用的新見解。我們的代碼已開源於https://github.com/ls-kelvin/REVPT。

English

Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.

強化工具輔助的視覺感知

Reinforced Visual Perception with Tools

摘要

Support