ChatPaper.aiChatPaper

面向工具集成推理的规模化智能体强化学习在视觉语言模型中的应用

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

November 24, 2025
作者: Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Hanrui Wang, Di Jin, Wenqi Shi, Xuan Wang
cs.AI

摘要

尽管当前视觉语言模型(VLMs)展现出强大的图像理解能力,但其"基于图像的思考"能力——即通过多步骤视觉交互进行推理——仍存在局限。我们推出VISTA-Gym这一可扩展的训练环境,旨在激发VLMs中工具集成式视觉推理能力的发展。该环境通过标准化视觉工具接口(如定位、解析)、可执行交互循环、可验证反馈信号及高效轨迹记录,统一了多样化的现实世界多模态推理任务(总计涵盖13个数据集中的7类任务),从而实现大规模视觉智能体强化学习。虽然现有VLMs在纯文本推理方面表现优异,但无论是专有模型还是开源模型,在工具选择、调用与协调方面仍面临挑战。基于VISTA-Gym,我们通过多轮轨迹采样和端到端强化学习训练出VISTA-R1模型,实现了工具使用与智能推理的交替进行。在11个公开推理密集型VQA基准测试中的广泛实验表明,VISTA-R1-8B模型以9.51%-18.72%的优势超越同类规模的先进基线模型,证明VISTA-Gym能有效解锁VLMs的工具集成推理能力。
English
While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.
PDF92December 1, 2025