探索者：面向多模態網路代理的探索驅動型網路軌跡合成擴展

摘要

近期，大型多模态模型（LMMs）的成功催生了能够自主完成复杂网络任务的智能代理的广泛应用。尽管开源LMM代理在离线评估基准上取得了显著进展，但在更贴近实际的在线环境中，其性能仍远未达到人类水平。一个关键瓶颈在于缺乏跨多个领域的多样化、大规模轨迹级数据集，而这类数据的收集成本高昂。本文通过开发一种可扩展的方法，合成了迄今为止最大且最多样化的轨迹级数据集，包含超过94,000条成功的多模态网络轨迹，涵盖49,000个唯一URL、720,000张截图及3,300万个网页元素。特别地，我们利用广泛的网络探索与优化来获取多样化的任务意图。每条成功轨迹的平均成本仅为28美分，使其对社区内的广大用户而言经济实惠。基于此数据集，我们训练了Explorer，一个多模态网络代理，并在Mind2Web-Live、Multimodal-Mind2Web及MiniWob++等离线与在线网络代理基准测试中展现了强劲性能。此外，我们的实验强调了数据规模作为提升网络代理能力的关键驱动力。我们期望本研究能推动更大规模基于LMM的先进代理研究更加普及。

English

Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.

探索者：面向多模態網路代理的探索驅動型網路軌跡合成擴展

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

摘要

Support