VerlTool:邁向整合工具使用的全觀能動性強化學習
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
September 1, 2025
作者: Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)在提升大型語言模型(LLM)的推理能力方面已展現出成功,但仍局限於單輪互動且未整合工具使用。儘管近期出現了以工具使用為核心的代理強化學習(ARLT)方法來處理多輪工具互動,現有研究開發的任務特定代碼庫存在碎片化、同步執行瓶頸及跨領域擴展性有限的問題。這些低效性阻礙了更廣泛的社區採用和算法創新。我們引入了VerlTool,這是一個通過系統化設計原則解決上述限制的統一且模塊化的框架。VerlTool提供了四大關鍵貢獻:(1) 與VeRL的上游對齊,確保兼容性並簡化維護,(2) 通過標準化API實現統一工具管理,支持包括代碼執行、搜索、SQL數據庫及視覺處理在內的多種模式,(3) 異步執行策略,消除同步瓶頸,實現近2倍的加速,(4) 全面評估,在6個ARLT領域展示出競爭力的性能。我們的框架將ARLT形式化為包含多模態觀察標記(文本/圖像/視頻)的多輪軌跡,超越了單輪RLVR範式。我們在數學推理、知識問答、SQL生成、視覺推理、網絡搜索及軟件工程任務上訓練並評估模型,取得了與專用系統相當的結果,同時提供了統一的訓練基礎設施。模塊化插件架構使得工具集成僅需輕量級Python定義,大幅降低開發開銷,為工具增強型RL研究提供了可擴展的基礎。我們的代碼已開源於https://github.com/TIGER-AI-Lab/verl-tool。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated
success in enhancing LLM reasoning capabilities, but remains limited to
single-turn interactions without tool integration. While recent Agentic
Reinforcement Learning with Tool use (ARLT) approaches have emerged to address
multi-turn tool interactions, existing works develop task-specific codebases
that suffer from fragmentation, synchronous execution bottlenecks, and limited
extensibility across domains. These inefficiencies hinder broader community
adoption and algorithmic innovation. We introduce VerlTool, a unified and
modular framework that addresses these limitations through systematic design
principles. VerlTool provides four key contributions: (1) upstream alignment
with VeRL ensuring compatibility and simplified maintenance, (2) unified tool
management via standardized APIs supporting diverse modalities including code
execution, search, SQL databases, and vision processing, (3) asynchronous
rollout execution achieving near 2times speedup by eliminating
synchronization bottlenecks, and (4) comprehensive evaluation demonstrating
competitive performance across 6 ARLT domains. Our framework formalizes ARLT as
multi-turn trajectories with multi-modal observation tokens (text/image/video),
extending beyond single-turn RLVR paradigms. We train and evaluate models on
mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web
search, and software engineering tasks, achieving results comparable to
specialized systems while providing unified training infrastructure. The
modular plugin architecture enables rapid tool integration requiring only
lightweight Python definitions, significantly reducing development overhead and
providing a scalable foundation for tool-augmented RL research. Our code is
open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.