ChatPaper.aiChatPaper

MolmoWeb:面向开放网络的开放式可视化网页智能体与开放数据

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

April 9, 2026
作者: Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna
cs.AI

摘要

网络智能体——这种代表用户在网络上自主导航并执行任务的系统——有望彻底改变人类与数字世界的交互方式。然而当前最先进的网络智能体均基于训练数据和配方未公开的专有模型,这限制了科学认知、结果复现和社区驱动的技术发展。 我们认为开放网络生态的智能体理应通过开放方式构建。为此,我们推出:(1) MolmoWebMix——一个融合浏览器任务演示与网页图形界面感知数据的大规模混合数据集;(2) MolmoWeb系列——全开放多模态网络智能体家族。具体而言,MolmoWebMix整合了来自多个互补生成管线的10万+条合成任务轨迹、3万+条人工演示数据、原子级网络技能轨迹,以及包含指代表达定位和屏幕截图问答的图形界面感知数据。MolmoWeb智能体作为指令条件化的视觉-语言动作策略运行:给定任务指令和网页截图,即可预测下一步浏览器操作,无需访问HTML代码、无障碍功能树或专用API。 该系列提供40亿和80亿参数版本,在WebVoyager、Online-Mind2Web和DeepShop等浏览器使用基准测试中,MolmoWeb智能体均取得最先进成果,性能超越Fara-7B、UI-Tars-1.5-7B和Holo1-7B等同等规模的开源权重模型。MolmoWeb-8B甚至超越了基于GPT-4o等更大规模闭源前沿模型构建的标记集智能体。我们通过并行推演与N选优测试时缩放策略进一步实现性能持续提升,在WebVoyager和Online-Mind2Web上分别达到94.7%和60.5%的pass@4成功率(对比pass@1的78.2%和35.3%)。我们将公开模型检查点、训练数据、代码和统一评估工具链,以促进结果复现并加速网络智能体的开放研究进程。
English
Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.
PDF260April 11, 2026