MolmoWeb: オープンなWebのためのオープンなビジュアルWebエージェントとオープンデータ

要旨

Webエージェント——ユーザーに代わってWeb上をナビゲートしタスクを実行する自律システム——は、人々のデジタル世界との関わり方を変革する可能性を秘めている。しかし、現在最も能力の高いWebエージェントは、学習データとレシピが非公開の独自モデルに依存しており、科学的理解、再現性、コミュニティ主導の進展を制限している。我々は、オープンなWebのためのエージェントはオープンに構築されるべきであると信じる。この目的に向けて、我々は以下を紹介する：(1) 大規模で多様な、ブラウザタスク実証データとWeb-GUI知覚データの混合物であるMolmoWebMix、(2) 完全にオープンなマルチモーダルWebエージェント群であるMolmoWeb。具体的には、MolmoWebMixは、複数の相補的な生成パイプラインから得られた10万以上の合成的タスク軌跡と、3万以上の人による実証データ、原子的Webスキル軌跡、参照式グラウンディングやスクリーンショット質問応答を含むGUI知覚データを組み合わせたものである。MolmoWebエージェントは、指示条件付きの視覚言語行動ポリシーとして動作する：タスク指示とウェブページのスクリーンショットが与えられると、HTML、アクセシビリティツリー、または特殊なAPIへのアクセスを一切必要とせずに、次のブラウザ操作を予測する。 4Bおよび8Bパラメータ規模で利用可能なMolmoWebエージェントは、WebVoyager、Online-Mind2Web、DeepShopなどのブラウザ使用ベンチマークにおいて、Fara-7B、UI-Tars-1.5-7B、Holo1-7Bなどの同規模のオープン重みのみのモデルを上回り、state-of-the-artの結果を達成する。MolmoWeb-8Bは、GPT-4oのようなはるかに大規模なクローズドなフロンティアモデル上に構築されたSet-of-Marks（SoM）エージェントも凌駕する。さらに、ベストオブN選択による並列ロールアウトを通じたテスト時スケーリングにより一貫した性能向上を示し、WebVoyagerとOnline-Mind2Webにおいて、それぞれpass@1が78.2%および35.3%であったのに対し、pass@4では94.7%および60.5%を達成した。再現性を可能にし、Webエージェントに関するオープンな研究を加速するため、モデルチェックポイント、学習データ、コード、および統一評価ハーネスを公開する予定である。

English

Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

MolmoWeb: オープンなWebのためのオープンなビジュアルWebエージェントとオープンデータ

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

要旨

Support