WebAgent-R1:通過端到端多輪強化學習訓練網絡代理
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
May 22, 2025
作者: Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
cs.AI
摘要
儘管強化學習(RL)在提升大型語言模型(LLMs)方面展現了顯著成效,但其主要聚焦於單輪任務,如解決數學問題。由於動態網頁界面中長時序決策的複雜性,訓練有效的多輪互動網頁代理仍具挑戰性。在本研究中,我們提出了WebAgent-R1,這是一個簡單而有效的端到端多輪RL框架,用於訓練網頁代理。它直接從與網頁環境的線上互動中學習,通過異步生成多樣化的軌跡,完全依賴於任務成功與否的二值獎勵進行指導。在WebArena-Lite基準上的實驗證明了WebAgent-R1的有效性,將Qwen-2.5-3B的任務成功率從6.1%提升至33.9%,Llama-3.1-8B的任務成功率從8.5%提升至44.8%,顯著超越了現有的最先進方法及如OpenAI o3等強大的專有模型。深入分析揭示了基於思考的提示策略及通過增加互動進行測試時擴展的有效性。我們進一步探討了不同的RL初始化策略,引入了兩個變體,即WebAgent-R1-Zero和WebAgent-R1-CoT,這強調了熱身訓練階段(即行為克隆)的重要性,並為在網頁代理中融入長鏈推理(CoT)提供了洞見。
English
While reinforcement learning (RL) has demonstrated remarkable success in
enhancing large language models (LLMs), it has primarily focused on single-turn
tasks such as solving math problems. Training effective web agents for
multi-turn interactions remains challenging due to the complexity of
long-horizon decision-making across dynamic web interfaces. In this work, we
present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework
for training web agents. It learns directly from online interactions with web
environments by asynchronously generating diverse trajectories, entirely guided
by binary rewards depending on task success. Experiments on the WebArena-Lite
benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task
success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to
44.8%, significantly outperforming existing state-of-the-art methods and strong
proprietary models such as OpenAI o3. In-depth analyses reveal the
effectiveness of the thinking-based prompting strategy and test-time scaling
through increased interactions for web tasks. We further investigate different
RL initialization policies by introducing two variants, namely WebAgent-R1-Zero
and WebAgent-R1-CoT, which highlight the importance of the warm-up training
stage (i.e., behavior cloning) and provide insights on incorporating long
chain-of-thought (CoT) reasoning in web agents.Summary
AI-Generated Summary