ChatPaper.aiChatPaper

VerlTool:邁向整合工具使用的全觀能動性強化學習

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

September 1, 2025
作者: Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen
cs.AI

摘要

可驗證獎勵的強化學習(RLVR)在提升大型語言模型(LLM)的推理能力方面已展現出成功,但仍局限於單輪互動且未整合工具使用。儘管近期出現了以工具使用為核心的代理強化學習(ARLT)方法來處理多輪工具互動,現有研究開發的任務特定代碼庫存在碎片化、同步執行瓶頸及跨領域擴展性有限的問題。這些低效性阻礙了更廣泛的社區採用和算法創新。我們引入了VerlTool,這是一個通過系統化設計原則解決上述限制的統一且模塊化的框架。VerlTool提供了四大關鍵貢獻:(1) 與VeRL的上游對齊,確保兼容性並簡化維護,(2) 通過標準化API實現統一工具管理,支持包括代碼執行、搜索、SQL數據庫及視覺處理在內的多種模式,(3) 異步執行策略,消除同步瓶頸,實現近2倍的加速,(4) 全面評估,在6個ARLT領域展示出競爭力的性能。我們的框架將ARLT形式化為包含多模態觀察標記(文本/圖像/視頻)的多輪軌跡,超越了單輪RLVR範式。我們在數學推理、知識問答、SQL生成、視覺推理、網絡搜索及軟件工程任務上訓練並評估模型,取得了與專用系統相當的結果,同時提供了統一的訓練基礎設施。模塊化插件架構使得工具集成僅需輕量級Python定義,大幅降低開發開銷,為工具增強型RL研究提供了可擴展的基礎。我們的代碼已開源於https://github.com/TIGER-AI-Lab/verl-tool。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2times speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.
PDF574September 3, 2025