VerlTool: 도구 사용을 통한 통합적 에이전트 강화 학습을 향하여

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 대형 언어 모델(LLM)의 추론 능력을 향상시키는 데 성공을 거두었지만, 도구 통합 없이 단일 턴 상호작용에 국한되어 있습니다. 최근 다중 턴 도구 상호작용을 해결하기 위해 도구 사용을 통한 에이전트 강화 학습(ARLT) 접근법이 등장했지만, 기존 연구들은 과제별 코드베이스를 개발함으로써 분산화, 동기식 실행 병목 현상, 그리고 도메인 간 확장성 부족 등의 문제를 겪고 있습니다. 이러한 비효율성은 더 넓은 커뮤니티의 채택과 알고리즘 혁신을 방해합니다. 우리는 이러한 한계를 체계적인 설계 원칙을 통해 해결하는 통합적이고 모듈식 프레임워크인 VerlTool을 소개합니다. VerlTool은 네 가지 주요 기여를 제공합니다: (1) VeRL과의 상위 호환성을 보장하고 유지 관리를 단순화하는 상위 정렬, (2) 코드 실행, 검색, SQL 데이터베이스, 비전 처리 등 다양한 모달리티를 지원하는 표준화된 API를 통한 통합 도구 관리, (3) 동기화 병목 현상을 제거하여 거의 2배의 속도 향상을 달성하는 비동기식 롤아웃 실행, 그리고 (4) 6개의 ARLT 도메인에서 경쟁력 있는 성능을 입증하는 포괄적인 평가. 우리의 프레임워크는 ARLT를 다중 턴 궤적과 다중 모달 관찰 토큰(텍스트/이미지/비디오)로 공식화하여 단일 턴 RLVR 패러다임을 확장합니다. 우리는 수학적 추론, 지식 QA, SQL 생성, 시각적 추론, 웹 검색, 소프트웨어 엔지니어링 과제에서 모델을 훈련하고 평가하며, 통합된 훈련 인프라를 제공하면서도 특화된 시스템과 비슷한 결과를 달성합니다. 모듈식 플러그인 아키텍처는 경량의 Python 정의만으로도 빠른 도구 통합을 가능하게 하여 개발 오버헤드를 크게 줄이고, 도구 강화 RL 연구를 위한 확장 가능한 기반을 제공합니다. 우리의 코드는 https://github.com/TIGER-AI-Lab/verl-tool에서 오픈소스로 제공됩니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2times speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.

VerlTool: 도구 사용을 통한 통합적 에이전트 강화 학습을 향하여

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use

초록

Support