ChatPaper.aiChatPaper

WebAgent-R1:通过端到端多轮强化学习训练网页智能体

WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

May 22, 2025
作者: Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
cs.AI

摘要

尽管强化学习(RL)在提升大语言模型(LLMs)方面已展现出显著成效,但其应用主要集中于单轮任务,如数学问题求解。训练能够有效应对多轮交互的网络代理仍面临挑战,这源于跨越动态网页界面进行长期决策的复杂性。本研究中,我们提出了WebAgent-R1,一个简洁而高效的端到端多轮RL框架,专为训练网络代理设计。该框架通过与网络环境的在线交互直接学习,异步生成多样化的轨迹,完全依赖于任务成功与否的二元奖励进行指导。在WebArena-Lite基准测试上的实验验证了WebAgent-R1的有效性,将Qwen-2.5-3B的任务成功率从6.1%提升至33.9%,Llama-3.1-8B从8.5%提升至44.8%,显著超越了现有最先进方法及如OpenAI o3等强大的专有模型。深入分析揭示了基于思考的提示策略及通过增加交互进行测试时扩展对网络任务的有效性。我们进一步探讨了不同的RL初始化策略,引入了WebAgent-R1-Zero和WebAgent-R1-CoT两个变体,强调了预热训练阶段(即行为克隆)的重要性,并为在网络代理中融入长链推理(CoT)提供了洞见。
English
While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.

Summary

AI-Generated Summary

PDF82May 23, 2025