エクスプローラー：マルチモーダルWebエージェントのための探索駆動型Web軌跡合成のスケーリング

要旨

大規模マルチモーダルモデル（LMM）の最近の成功により、複雑なウェブタスクを自律的に完了可能なエージェントの有望な応用が期待されています。オープンソースのLMMエージェントは、オフライン評価ベンチマークにおいて大きな進展を遂げていますが、より現実的なオンライン設定では、人間レベルの能力に比べてまだ大きく遅れを取っています。主なボトルネックは、様々なドメインにわたる多様で大規模な軌跡レベルのデータセットの不足であり、これらを収集するには多大なコストがかかります。本論文では、この課題に対処するため、これまでで最大かつ最も多様な軌跡レベルのデータセットを合成するためのスケーラブルな手法を開発しました。このデータセットには、94,000以上の成功したマルチモーダルウェブ軌跡、49,000のユニークなURL、720,000のスクリーンショット、および3,300万のウェブ要素が含まれています。特に、多様なタスク意図を得るために、広範なウェブ探索と精緻化を活用しています。成功した軌跡あたりの平均コストは28セントであり、コミュニティ内の幅広いユーザーにとって手頃な価格となっています。このデータセットを活用して、マルチモーダルウェブエージェント「Explorer」を訓練し、Mind2Web-Live、Multimodal-Mind2Web、MiniWob++などのオフラインおよびオンラインのウェブエージェントベンチマークで高い性能を実証しました。さらに、我々の実験は、ウェブエージェントの能力向上におけるデータスケーリングの重要性を強調しています。本研究が、大規模なLMMベースのエージェント研究をよりアクセスしやすいものにすることを願っています。

English

Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.

エクスプローラー：マルチモーダルWebエージェントのための探索駆動型Web軌跡合成のスケーリング

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

要旨

Support