工具强化的视觉感知

摘要

视觉推理作为人类智能的基石，涵盖了解决多样化视觉问题所必需的复杂感知与逻辑过程。尽管计算机视觉的进步已催生出针对各类感知任务的强大模型，但将其应用于通用视觉推理仍面临挑战。先前研究表明，通过监督微调将视觉模型与大型语言模型（LLMs）结合可提升性能，但存在数据生成成本高、依赖精细数据筛选及泛化能力差等关键局限。为应对这些问题，我们提出ReVPT，旨在通过强化学习增强多模态LLMs在视觉工具使用与推理方面的能力。我们基于GRPO设计了一种新颖的强化学习算法，专门训练模型运用一套包含四种视觉工具的系统。大量实验证明，我们的方法在多个感知密集型基准测试（如SAT、CV-Bench、BLINK和MMStar）上达到了业界领先水平，显著超越了监督学习和基于文本的强化学习微调基线。尤为突出的是，ReVPT-3B和ReVPT-7B在CV-Bench上分别以9.03%和9.44%的优势超越了指导模型。最后，我们通过广泛的消融实验，为社区带来了关于基于强化学习的视觉工具使用的新洞见。我们的代码已发布于https://github.com/ls-kelvin/REVPT。

English

Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.

工具强化的视觉感知

Reinforced Visual Perception with Tools

摘要

Support