Mind2Web 2:以代理即法官的方式評估代理搜索
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
June 26, 2025
作者: Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
cs.AI
摘要
諸如深度研究系統的代理式搜索,其中大型語言模型自主瀏覽網絡、綜合信息並返回有引用支持的全面答案,代表了用戶與網絡規模信息互動方式的重大轉變。雖然這種方式承諾更高的效率和認知卸載,但代理式搜索日益增長的複雜性和開放性已超越了現有的評估基準和方法論,這些基準和方法論大多假設了較短的搜索視野和靜態答案。在本文中,我們介紹了Mind2Web 2,這是一個包含130個現實、高質量且長期視野任務的基準,這些任務需要實時網絡瀏覽和廣泛的信息綜合,並通過超過1000小時的人力勞動構建。為了解決評估時變和複雜答案的挑戰,我們提出了一種新穎的「代理即法官」框架。我們的方法基於樹狀結構的評分設計構建特定任務的法官代理,以自動評估答案的正確性和來源歸屬。我們對九個前沿代理式搜索系統和人類表現進行了全面評估,並進行了詳細的錯誤分析,以汲取未來發展的見解。表現最佳的系統,OpenAI深度研究,已經能夠在花費一半時間的情況下達到人類表現的50-70%,顯示出巨大的潛力。總的來說,Mind2Web 2為開發和基準測試下一代代理式搜索系統提供了嚴謹的基礎。
English
Agentic search such as Deep Research systems, where large language models
autonomously browse the web, synthesize information, and return comprehensive
citation-backed answers, represents a major shift in how users interact with
web-scale information. While promising greater efficiency and cognitive
offloading, the growing complexity and open-endedness of agentic search have
outpaced existing evaluation benchmarks and methodologies, which largely assume
short search horizons and static answers. In this paper, we introduce Mind2Web
2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that
require real-time web browsing and extensive information synthesis, constructed
with over 1,000 hours of human labor. To address the challenge of evaluating
time-varying and complex answers, we propose a novel Agent-as-a-Judge
framework. Our method constructs task-specific judge agents based on a
tree-structured rubric design to automatically assess both answer correctness
and source attribution. We conduct a comprehensive evaluation of nine frontier
agentic search systems and human performance, along with a detailed error
analysis to draw insights for future development. The best-performing system,
OpenAI Deep Research, can already achieve 50-70% of human performance while
spending half the time, showing a great potential. Altogether, Mind2Web 2
provides a rigorous foundation for developing and benchmarking the next
generation of agentic search systems.