Mind2Web 2：以智能体为评判者的代理搜索评估

摘要

诸如深度研究系统等自主搜索技术，其中大型语言模型能够自主浏览网页、整合信息并返回带有全面引用的答案，代表了用户与网络规模信息交互方式的重大转变。尽管这类技术承诺带来更高的效率和认知负荷减轻，但其日益增长的复杂性和开放性已超越了现有的评估基准和方法论，这些基准和方法论大多假设搜索范围较短且答案静态不变。本文中，我们推出了Mind2Web 2，这是一个包含130项现实、高质量且长期任务的数据集，这些任务要求实时网页浏览和广泛的信息整合，构建过程耗费了超过1000小时的人力。为了应对评估随时间变化且复杂答案的挑战，我们提出了一种新颖的“代理即裁判”框架。我们的方法基于树形结构评分标准设计，构建特定任务的裁判代理，以自动评估答案的正确性和来源归属。我们对九种前沿自主搜索系统及人类表现进行了全面评估，并进行了详细的错误分析，为未来发展提供洞见。表现最佳的系统——OpenAI深度研究，已能在花费一半时间的情况下达到人类表现的50-70%，展现出巨大潜力。总之，Mind2Web 2为开发和基准测试下一代自主搜索系统奠定了坚实基础。

English

Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.