OpenWebRL：揭秘視覺化Web智能體中的在線多輪強化學習

摘要

開發具備能力的視覺網路代理，需要長程推理、精確定位，以及與動態真實網站進行穩健互動。儘管進展迅速，最強大的系統仍多為專有，而開放式代理則仍高度依賴基於大量精心策劃的網路軌跡進行監督式後訓練。這種依賴造成了重大的擴展性瓶頸：高品質的示範數據收集成本高昂，且靜態資料集對於多樣且不斷變化的開放網路的覆蓋範圍有限。雖然線上強化學習在基於文字的代理方面已展現潛力，但其直接用於在即時網站上訓練視覺網路代理的潛力仍 largely 未被充分探索。在本文中，我們介紹 OpenWebRL，這是一個用於在真實網站上透過線上多輪強化學習訓練視覺網路代理的開放式框架。OpenWebRL 涵蓋完整的訓練流程，包括可擴展的即時瀏覽器基礎設施、監督式初始化、多模態上下文管理、軌跡級成功判斷，以及高效的多輪策略最佳化。利用此框架，我們訓練出 OpenWebRL-4B，其在具挑戰性的即時網路基準測試上樹立了新的開源最先進技術。僅使用 0.4K 初始化軌跡和 2.2K 開放式強化學習訓練任務，OpenWebRL-4B 在 Online-Mind2Web 上達到 67.0% 的成功率，在 DeepShop 上達到 64.0%，超越了先前類似或更大規模的開放式代理，並能與包括 OpenAI CUA 和 Gemini CUA 在內的專有系統競爭。除了出色的基準測試表現，我們也系統性地研究了使線上強化學習對視覺網路代理有效的關鍵設計選擇，並分析了強化學習如何改善代理推理。總體而言，我們的工作為構建更強大、可重現且具成本效益的開放式網路代理提供了實用途徑。我們將釋出訓練資料、模型和程式碼以支持未來研究。

English

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.