WebAgent-R1: 엔드투엔드 다중 턴 강화 학습을 통한 웹 에이전트 훈련

초록

강화 학습(RL)이 대규모 언어 모델(LLM)의 성능을 향상시키는 데 있어 놀라운 성과를 보여왔지만, 주로 수학 문제 해결과 같은 단일 턴 작업에 초점을 맞추어 왔습니다. 동적인 웹 인터페이스에서 장기적인 의사결정의 복잡성으로 인해 다중 턴 상호작용을 위한 효과적인 웹 에이전트를 훈련하는 것은 여전히 어려운 과제로 남아 있습니다. 본 연구에서는 웹 에이전트를 훈련하기 위한 간단하면서도 효과적인 종단 간 다중 턴 RL 프레임워크인 WebAgent-R1을 제안합니다. 이 프레임워크는 웹 환경과의 온라인 상호작용에서 직접 학습하며, 비동기적으로 다양한 궤적을 생성하고, 작업 성공 여부에 따라 결정되는 이진 보상에 전적으로 의존합니다. WebArena-Lite 벤치마크에서의 실험 결과, WebAgent-R1은 Qwen-2.5-3B의 작업 성공률을 6.1%에서 33.9%로, Llama-3.1-8B의 작업 성공률을 8.5%에서 44.8%로 크게 향상시켜 기존의 최신 방법들과 OpenAI o3와 같은 강력한 독점 모델들을 크게 능가하는 성과를 보였습니다. 심층 분석을 통해 사고 기반 프롬프트 전략과 테스트 시간 확장을 통한 상호작용 증가가 웹 작업에 효과적임을 확인했습니다. 또한, WebAgent-R1-Zero와 WebAgent-R1-CoT라는 두 가지 변형을 도입하여 다양한 RL 초기화 정책을 조사함으로써 웜업 훈련 단계(즉, 행동 복제)의 중요성을 강조하고, 웹 에이전트에 긴 사고의 연쇄(CoT) 추론을 통합하는 방법에 대한 통찰을 제공했습니다.

English

While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.

WebAgent-R1: 엔드투엔드 다중 턴 강화 학습을 통한 웹 에이전트 훈련

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

초록

Support