OpenWebRL：揭秘面向视觉网页代理的在线多轮强化学习

摘要

构建强大的视觉网络代理需要长程推理、精确的基础能力，以及与动态真实网络环境的稳健交互。尽管进展迅速，但最先进的系统仍多为专有，而开源代理则严重依赖对大量精选网络轨迹进行监督式后训练。这种依赖性造成了显著的扩展性瓶颈：高质量示范数据的收集成本高昂，且静态数据集对多样化、不断变化的开放网络的覆盖范围有限。尽管在线强化学习在基于文本的代理中已展现出潜力，但其直接应用于真实网站以训练视觉网络代理的潜力尚待充分探索。本文提出 OpenWebRL，一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL 覆盖完整训练流程，包括可扩展的实时浏览器基础设施、监督式初始化、多模态上下文管理、轨迹级成功判定以及高效的多轮策略优化。利用该框架，我们训练了 OpenWebRL-4B，在具有挑战性的实时网络基准测试中确立了新的开源最优水平。仅使用 0.4K 初始化轨迹和 2.2K 开放式强化学习训练任务，OpenWebRL-4B 在 Online-Mind2Web 上达到 67.0% 的成功率，在 DeepShop 上达到 64.0%，超越了相似或更大规模的先前开源代理，并与包括 OpenAI CUA 和 Gemini CUA 在内的专有系统保持竞争力。除了强大的基准性能外，我们系统性研究了使在线强化学习对视觉网络代理有效的关键设计选择，并分析了强化学习如何提升代理推理能力。总体而言，我们的工作为构建更强大、可复现且成本效益更高的开放网络代理提供了实践路径。我们将发布训练数据、模型和代码以支持未来研究。

English

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.