ChatPaper.aiChatPaper

VerlTool:迈向工具使用的整体性智能体强化学习

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

September 1, 2025
作者: Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen
cs.AI

摘要

可验证奖励的强化学习(RLVR)在提升大语言模型推理能力方面已展现出显著成效,但局限于单轮交互且未整合工具使用。尽管近期出现了面向多轮工具交互的代理式强化学习结合工具使用(ARLT)方法,现有研究多开发任务专用代码库,存在碎片化、同步执行瓶颈及跨领域扩展性受限等问题。这些低效性阻碍了更广泛的社区采用与算法创新。我们推出VerlTool,一个通过系统设计原则解决上述局限的统一模块化框架。VerlTool贡献了四大关键点:(1) 与VeRL上游对齐,确保兼容性与简化维护;(2) 通过标准化API统一工具管理,支持包括代码执行、搜索、SQL数据库及视觉处理在内的多种模态;(3) 异步执行策略,消除同步瓶颈,实现近2倍加速;(4) 全面评估,在6个ARLT领域展现竞争力。本框架将ARLT形式化为包含多模态观察令牌(文本/图像/视频)的多轮轨迹,超越了单轮RLVR范式。我们在数学推理、知识问答、SQL生成、视觉推理、网络搜索及软件工程任务上训练并评估模型,取得与专用系统相当的成绩,同时提供统一的训练基础设施。模块化插件架构支持快速工具集成,仅需轻量级Python定义,大幅降低开发负担,为工具增强的RL研究奠定可扩展基础。代码已开源,地址为https://github.com/TIGER-AI-Lab/verl-tool。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2times speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.
PDF574September 3, 2025