MolmoWeb: 오픈 웹을 위한 오픈 소스 비주얼 웹 에이전트 및 오픈 데이터

초록

웹 에이전트(사용자를 대신해 웹을 탐색하고 작업을 실행하는 자율 시스템)는 사람들이 디지털 세계와 상호작용하는 방식을 혁신할 잠재력을 지니고 있습니다. 그러나 현재 가장 성능이 뛰어난 웹 에이전트들은 공개되지 않은 학습 데이터와 방법론으로 훈련된 독점 모델에 의존하고 있어, 과학적 이해와 재현성, 커뮤니티 주도의 발전을 제한하고 있습니다. 우리는 개방형 웹을 위한 에이전트 역시 개방적으로 구축되어야 한다고 믿습니다. 이를 위해 우리는 (1) 대규모이고 다양한 브라우저 작업 데모 및 웹-GUI 인식 데이터의 혼합체인 MolmoWebMix와 (2) 완전히 오픈된 멀티모달 웹 에이전트 패밀리인 MolmoWeb을 소개합니다. 구체적으로 MolmoWebMix는 여러 상호 보완적인 생성 파이프라인에서 나온 10만 개 이상의 합성 작업 경로와 3만 개 이상의 인간 데모, 기본적인 웹 기술 경로, 참조 표현 기반 및 스크린샷 질의응답을 포함한 GUI 인식 데이터를 결합했습니다. MolmoWeb 에이전트는 지시어 조건 비전-언어 행동 정책으로 작동합니다: 작업 지시어와 웹페이지 스크린샷이 주어지면, 이들은 HTML, 접근성 트리 또는 특수 API에 대한 접근 없이도 다음 브라우저 작업을 예측합니다. 4B와 8B 규모로 제공되는 MolmoWeb 에이전트는 WebVoyager, Online-Mind2Web, DeepShop과 같은 브라우저 사용 벤치마크에서 Fara-7B, UI-Tars-1.5-7B, Holo1-7B와 같은 유사 규모의 오픈 가중치 전용 모델들을 능가하는 최첨단 성능을 달성했습니다. MolmoWeb-8B는 GPT-4o와 같은 훨씬 더 큰 폐쇄형 최신 모델을 기반으로 구축된 set-of-marks(SoM) 에이전트들도 능가했습니다. 우리는 또한 최적-N 선택을 통한 병렬 롤아웃으로 테스트 시 확장을 통해 꾸준한 성능 향상을 보여주며, WebVoyager와 Online-Mind2Web에서 각각 pass@1 기준 78.2%, 35.3%에 비해 pass@4 기준 94.7%, 60.5%를 달성했습니다. 재현성을 가능하게 하고 웹 에이전트에 대한 개방형 연구를 가속화하기 위해 모델 체크포인트, 학습 데이터, 코드, 그리고 통합 평가 도구를 공개할 예정입니다.

English

Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

MolmoWeb: 오픈 웹을 위한 오픈 소스 비주얼 웹 에이전트 및 오픈 데이터

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

초록

Support