OpenWebRL: 視覚的Webエージェントのためのオンライン多ターン強化学習の解明

要旨

能力のある視覚的Webエージェントを構築するには、長期的推論、精密なグラウンディング、そして動的な実在のWebサイトとの堅牢な対話が必要です。急速な進歩にもかかわらず、最も強力なシステムは大部分がプロプライエタリなままである一方、オープンエージェントは依然として厳選された大規模なWeb軌跡コレクションに対する教師ありポストトレーニングに大きく依存しています。この依存関係は、大きなスケーラビリティのボトルネックを生み出します。高品質なデモンストレーションは収集にコストがかかり、静的なデータセットは多様で絶えず変化するオープンWebのカバレッジが限られているからです。オンライン強化学習はテキストベースのエージェントに対して有望性を示していますが、実在のWebサイト上で直接視覚的Webエージェントを訓練する可能性はほとんど未探索のままです。本論文では、実在のWebサイト上でオンラインマルチターン強化学習を用いて視覚的Webエージェントを訓練するためのオープンフレームワークであるOpenWebRLを紹介します。OpenWebRLは、スケーラブルなライブブラウザインフラストラクチャ、教師あり初期化、マルチモーダルコンテキスト管理、軌跡レベルの成功判定、効率的なマルチターンポリシー最適化を含む、トレーニングパイプライン全体をカバーします。このフレームワークを用いて、OpenWebRL-4Bを訓練し、挑戦的なライブWebベンチマークにおいて新しいオープンソースの最先端を確立しました。わずか0.4Kの初期化軌跡と2.2Kのオープンエンドな強化学習訓練タスクで、OpenWebRL-4BはOnline-Mind2Webで67.0％、DeepShopで64.0％の成功率を達成し、同程度またはより大規模な従来のオープンエージェントを上回り、OpenAI CUAやGemini CUAを含むプロプライエタリシステムとも競争力があります。強力なベンチマーク性能に加えて、オンライン強化学習を視覚的Webエージェントに効果的にする主要な設計選択を体系的に研究し、強化学習がエージェント的推論をどのように改善するかを分析します。全体として、我々の研究は、より能力が高く、再現可能で、コスト効率的なオープンWebエージェントを構築するための実践的な道を提供します。将来の研究を支援するために、訓練データ、モデル、コードを公開する予定です。

English

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.