OpenWebRL: 시각적 웹 에이전트를 위한 온라인 다중 턴 강화 학습 규명

초록

강력한 시각적 웹 에이전트를 구축하려면 장기적인 추론 능력, 정밀한 근거 설정, 그리고 역동적인 실제 웹사이트와의 견고한 상호작용이 필요하다. 빠른 발전에도 불구하고, 가장 강력한 시스템들은 대부분 독점적으로 유지되고 있으며, 오픈 에이전트들은 여전히 대규모로 수집된 정제된 웹 궤적 데이터에 대한 지도 사후 훈련에 크게 의존하고 있다. 이러한 의존성은 주요 확장성 병목을 야기한다: 고품질 시연 데이터를 수집하는 데 비용이 많이 들고, 정적 데이터셋은 다양하고 끊임없이 변화하는 개방형 웹 환경을 제한적으로만 포괄한다. 온라인 강화 학습(RL)이 텍스트 기반 에이전트에 유망한 것으로 입증되었지만, 실제 웹사이트에서 직접 시각적 웹 에이전트를 훈련하는 데 있어 그 잠재력은 여전히 거의 탐구되지 않았다. 본 논문에서는 실제 웹사이트에서 온라인 다중 턴 RL을 통해 시각적 웹 에이전트를 훈련하기 위한 개방형 프레임워크인 OpenWebRL을 소개한다. OpenWebRL은 확장 가능한 실시간 브라우저 인프라, 지도 초기화, 멀티모달 컨텍스트 관리, 궤적 수준 성공 판단, 효율적인 다중 턴 정책 최적화를 포함한 전체 훈련 파이프라인을 포괄한다. 이 프레임워크를 사용하여 OpenWebRL-4B를 훈련시켰으며, 이는 까다로운 실시간 웹 벤치마크에서 새로운 오픈소스 최첨단 성능을 확립했다. 단 0.4K 개의 초기화 궤적과 2.2K 개의 개방형 RL 훈련 작업만으로 OpenWebRL-4B는 Online-Mind2Web에서 67.0%, DeepShop에서 64.0%의 성공률을 달성하여, 유사하거나 더 큰 규모의 이전 오픈 에이전트들을 능가하고 OpenAI CUA 및 Gemini CUA를 포함한 독점 시스템과도 경쟁력을 유지했다. 강력한 벤치마크 성능 외에도, 온라인 RL을 시각적 웹 에이전트에 효과적으로 만드는 핵심 설계 선택들을 체계적으로 연구하고, RL이 에이전트 추론 능력을 어떻게 향상시키는지 분석한다. 전반적으로, 본 연구는 더 강력하고, 재현 가능하며, 비용 효율적인 오픈 웹 에이전트를 구축하기 위한 실용적인 경로를 제시한다. 향후 연구를 지원하기 위해 훈련 데이터, 모델, 코드를 공개할 예정이다.

English

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.