广域搜索:智能体广泛信息寻求的基准测试
WideSearch: Benchmarking Agentic Broad Info-Seeking
August 11, 2025
作者: Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
cs.AI
摘要
从专业研究到日常规划,许多任务都受制于大规模信息检索的瓶颈,这种工作更多是重复性而非认知复杂性的。随着大型语言模型(LLMs)的快速发展,由LLMs驱动的自动化搜索代理为解放人类于此类繁琐工作提供了颇具前景的解决方案。然而,由于缺乏合适的基准测试,这些代理在可靠且完整地执行“广域上下文”信息收集方面的能力仍未被充分评估。为填补这一空白,我们推出了WideSearch,一个专为评估代理在大规模收集任务中的可靠性而设计的新基准。该基准包含200道手工筛选的问题(100道英文,100道中文),覆盖超过15个不同领域,均基于真实用户查询。每项任务要求代理收集大规模原子信息,这些信息可逐一客观验证,并整理成结构化的输出。通过严格的五阶段质量控制流程,确保了数据集的难度、完整性和可验证性。我们对超过10种最先进的代理搜索系统进行了基准测试,包括单代理、多代理框架以及端到端的商业系统。大多数系统的总体成功率接近0%,表现最佳者仅达到5%。然而,若给予充足时间,多位人类测试者的交叉验证可实现接近100%的成功率。这些结果表明,当前搜索代理在大规模信息检索方面存在显著不足,凸显了代理搜索领域未来研究与开发的紧迫需求。我们的数据集、评估流程及基准测试结果已公开发布于https://widesearch-seed.github.io/。
English
From professional research to everyday planning, many tasks are bottlenecked
by wide-scale information seeking, which is more repetitive than cognitively
complex. With the rapid development of Large Language Models (LLMs), automated
search agents powered by LLMs offer a promising solution to liberate humans
from this tedious work. However, the capability of these agents to perform such
"wide-context" collection reliably and completely remains largely unevaluated
due to a lack of suitable benchmarks. To bridge this gap, we introduce
WideSearch, a new benchmark engineered to evaluate agent reliability on these
large-scale collection tasks. The benchmark features 200 manually curated
questions (100 in English, 100 in Chinese) from over 15 diverse domains,
grounded in real user queries. Each task requires agents to collect large-scale
atomic information, which could be verified one by one objectively, and arrange
it into a well-organized output. A rigorous five-stage quality control pipeline
ensures the difficulty, completeness, and verifiability of the dataset. We
benchmark over 10 state-of-the-art agentic search systems, including
single-agent, multi-agent frameworks, and end-to-end commercial systems. Most
systems achieve overall success rates near 0\%, with the best performer
reaching just 5\%. However, given sufficient time, cross-validation by multiple
human testers can achieve a near 100\% success rate. These results demonstrate
that present search agents have critical deficiencies in large-scale
information seeking, underscoring urgent areas for future research and
development in agentic search. Our dataset, evaluation pipeline, and benchmark
results have been publicly released at https://widesearch-seed.github.io/