VerlTool: ツール利用を伴う包括的エージェント型強化学習に向けて

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の推論能力を向上させることに成功しているが、単一ターンの相互作用に限定されており、ツールの統合が行われていない。一方、最近では多段階のツール相互作用に対応するためのエージェント型強化学習とツール利用（ARLT）アプローチが登場しているが、既存の研究ではタスク固有のコードベースが開発されており、断片化、同期実行のボトルネック、およびドメイン間での拡張性の制限といった問題が生じている。これらの非効率性は、より広範なコミュニティの採用やアルゴリズムの革新を妨げている。本論文では、これらの制限を体系的設計原則を通じて解決する統一されたモジュール型フレームワークであるVerlToolを紹介する。VerlToolは以下の4つの主要な貢献を提供する：（1）VeRLとの上流整合性を確保し、互換性と簡素化されたメンテナンスを実現、（2）コード実行、検索、SQLデータベース、視覚処理など多様なモダリティをサポートする標準化されたAPIによる統一ツール管理、（3）同期ボトルネックを排除することで約2倍の高速化を実現する非同期ロールアウト実行、（4）6つのARLTドメインにわたる競争力のある性能を示す包括的評価。本フレームワークは、ARLTを多段階の軌跡と多モーダルな観測トークン（テキスト/画像/動画）として形式化し、単一ターンのRLVRパラダイムを超える拡張を提供する。数学的推論、知識QA、SQL生成、視覚推論、ウェブ検索、ソフトウェアエンジニアリングタスクにおいてモデルを訓練および評価し、専門化されたシステムに匹敵する結果を達成するとともに、統一された訓練インフラを提供する。モジュール型プラグインアーキテクチャにより、軽量なPython定義のみで迅速なツール統合が可能となり、開発オーバーヘッドを大幅に削減し、ツール拡張型RL研究のためのスケーラブルな基盤を提供する。本コードはhttps://github.com/TIGER-AI-Lab/verl-toolでオープンソースとして公開されている。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2times speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.

VerlTool: ツール利用を伴う包括的エージェント型強化学習に向けて

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

要旨

Support