MolmoWeb:面向开放网络的视觉网页智能体开源平台与开放数据集
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
April 9, 2026
作者: Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna
cs.AI
摘要
网络智能体——这种代表用户在互联网上自主导航并执行任务的系统——有望彻底改变人类与数字世界的交互方式。然而当前最先进的网络智能体均基于训练数据和配方未公开的专有模型,这限制了科学认知、结果复现和社区驱动的技术发展。
我们认为开放网络生态的智能体应当以开放方式构建。为此,我们推出:(1) MolmoWebMix——融合浏览器任务演示与网页图形界面感知数据的大规模混合数据集;(2) MolmoWeb系列——完全开源的 multimodal 网络智能体家族。具体而言,MolmoWebMix 整合了来自多个互补生成管线的10万余条合成任务轨迹、3万余条人工演示数据、原子级网页技能轨迹以及包含指代表达定位和屏幕截图问答的图形界面感知数据。MolmoWeb智能体采用指令条件化的视觉-语言动作策略:给定任务指令和网页截图,即可预测下一步浏览器操作,无需访问HTML、无障碍功能树或专用API。
该系列提供40亿和80亿参数版本,在WebVoyager、Online-Mind2Web和DeepShop等浏览器使用基准测试中,其表现均优于Fara-7B、UI-Tars-1.5-7B和Holo1-7B等同等规模的开源模型,达到最先进水平。MolmoWeb-8B甚至超越了基于GPT-4o等更大规模闭源前沿模型构建的标记集(SoM)智能体。通过采用并行推演与N选优的测试时扩展策略,我们进一步实现了性能的持续提升:在WebVoyager和Online-Mind2Web上的pass@4指标分别达到94.7%和60.5%(对比pass@1指标的78.2%和35.3%)。我们将发布模型检查点、训练数据、代码和统一评估工具链,以促进结果复现并加速网络智能体的开放研究。
English
Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress.
We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs.
Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.