Mind2Web 2:以智能体为评判者的代理搜索评估
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
June 26, 2025
作者: Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
cs.AI
摘要
诸如深度研究系统等自主搜索技术,其中大型语言模型能够自主浏览网页、整合信息并返回带有全面引用的答案,代表了用户与网络规模信息交互方式的重大转变。尽管这类技术承诺带来更高的效率和认知负荷减轻,但其日益增长的复杂性和开放性已超越了现有的评估基准和方法论,这些基准和方法论大多假设搜索范围较短且答案静态不变。本文中,我们推出了Mind2Web 2,这是一个包含130项现实、高质量且长期任务的数据集,这些任务要求实时网页浏览和广泛的信息整合,构建过程耗费了超过1000小时的人力。为了应对评估随时间变化且复杂答案的挑战,我们提出了一种新颖的“代理即裁判”框架。我们的方法基于树形结构评分标准设计,构建特定任务的裁判代理,以自动评估答案的正确性和来源归属。我们对九种前沿自主搜索系统及人类表现进行了全面评估,并进行了详细的错误分析,为未来发展提供洞见。表现最佳的系统——OpenAI深度研究,已能在花费一半时间的情况下达到人类表现的50-70%,展现出巨大潜力。总之,Mind2Web 2为开发和基准测试下一代自主搜索系统奠定了坚实基础。
English
Agentic search such as Deep Research systems, where large language models
autonomously browse the web, synthesize information, and return comprehensive
citation-backed answers, represents a major shift in how users interact with
web-scale information. While promising greater efficiency and cognitive
offloading, the growing complexity and open-endedness of agentic search have
outpaced existing evaluation benchmarks and methodologies, which largely assume
short search horizons and static answers. In this paper, we introduce Mind2Web
2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that
require real-time web browsing and extensive information synthesis, constructed
with over 1,000 hours of human labor. To address the challenge of evaluating
time-varying and complex answers, we propose a novel Agent-as-a-Judge
framework. Our method constructs task-specific judge agents based on a
tree-structured rubric design to automatically assess both answer correctness
and source attribution. We conduct a comprehensive evaluation of nine frontier
agentic search systems and human performance, along with a detailed error
analysis to draw insights for future development. The best-performing system,
OpenAI Deep Research, can already achieve 50-70% of human performance while
spending half the time, showing a great potential. Altogether, Mind2Web 2
provides a rigorous foundation for developing and benchmarking the next
generation of agentic search systems.