ChatPaper.aiChatPaper

广域搜索:智能体广泛信息寻求的基准测试

WideSearch: Benchmarking Agentic Broad Info-Seeking

August 11, 2025
作者: Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
cs.AI

摘要

从专业研究到日常规划,许多任务都受制于大规模信息检索的瓶颈,这种工作更多是重复性而非认知复杂性的。随着大型语言模型(LLMs)的快速发展,由LLMs驱动的自动化搜索代理为解放人类于此类繁琐工作提供了颇具前景的解决方案。然而,由于缺乏合适的基准测试,这些代理在可靠且完整地执行“广域上下文”信息收集方面的能力仍未被充分评估。为填补这一空白,我们推出了WideSearch,一个专为评估代理在大规模收集任务中的可靠性而设计的新基准。该基准包含200道手工筛选的问题(100道英文,100道中文),覆盖超过15个不同领域,均基于真实用户查询。每项任务要求代理收集大规模原子信息,这些信息可逐一客观验证,并整理成结构化的输出。通过严格的五阶段质量控制流程,确保了数据集的难度、完整性和可验证性。我们对超过10种最先进的代理搜索系统进行了基准测试,包括单代理、多代理框架以及端到端的商业系统。大多数系统的总体成功率接近0%,表现最佳者仅达到5%。然而,若给予充足时间,多位人类测试者的交叉验证可实现接近100%的成功率。这些结果表明,当前搜索代理在大规模信息检索方面存在显著不足,凸显了代理搜索领域未来研究与开发的紧迫需求。我们的数据集、评估流程及基准测试结果已公开发布于https://widesearch-seed.github.io/。
English
From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/
PDF933August 12, 2025