WideSearch:基準測試代理廣泛資訊搜尋
WideSearch: Benchmarking Agentic Broad Info-Seeking
August 11, 2025
作者: Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
cs.AI
摘要
從專業研究到日常規劃,許多任務都受制於大規模信息檢索的瓶頸,這種工作往往重複性高而非認知複雜。隨著大型語言模型(LLMs)的快速發展,基於LLMs的自動搜索代理為解放人類於此類繁瑣工作提供了潛在解決方案。然而,這些代理在執行“廣上下文”信息收集時的可靠性和完整性仍缺乏充分評估,主要由於合適的基準測試缺失。為填補這一空白,我們推出了WideSearch,一個專為評估代理在大規模收集任務中可靠性而設計的新基準。該基準包含200道手工挑選的問題(100道英文,100道中文),涵蓋超過15個不同領域,基於真實用戶查詢。每項任務要求代理收集大規模原子信息,這些信息可逐一客觀驗證,並整理成結構化輸出。通過嚴格的五階段質量控制流程,確保了數據集的難度、完整性和可驗證性。我們對超過10種頂尖的代理搜索系統進行了基準測試,包括單代理、多代理框架及端到端商業系統。大多數系統的總體成功率接近0%,表現最佳者僅達5%。然而,若給予充足時間,多位人類測試者的交叉驗證可實現接近100%的成功率。這些結果表明,現有搜索代理在大規模信息檢索方面存在顯著不足,凸顯了代理搜索未來研究與開發的緊迫領域。我們的數據集、評估流程及基準測試結果已公開發佈於https://widesearch-seed.github.io/。
English
From professional research to everyday planning, many tasks are bottlenecked
by wide-scale information seeking, which is more repetitive than cognitively
complex. With the rapid development of Large Language Models (LLMs), automated
search agents powered by LLMs offer a promising solution to liberate humans
from this tedious work. However, the capability of these agents to perform such
"wide-context" collection reliably and completely remains largely unevaluated
due to a lack of suitable benchmarks. To bridge this gap, we introduce
WideSearch, a new benchmark engineered to evaluate agent reliability on these
large-scale collection tasks. The benchmark features 200 manually curated
questions (100 in English, 100 in Chinese) from over 15 diverse domains,
grounded in real user queries. Each task requires agents to collect large-scale
atomic information, which could be verified one by one objectively, and arrange
it into a well-organized output. A rigorous five-stage quality control pipeline
ensures the difficulty, completeness, and verifiability of the dataset. We
benchmark over 10 state-of-the-art agentic search systems, including
single-agent, multi-agent frameworks, and end-to-end commercial systems. Most
systems achieve overall success rates near 0\%, with the best performer
reaching just 5\%. However, given sufficient time, cross-validation by multiple
human testers can achieve a near 100\% success rate. These results demonstrate
that present search agents have critical deficiencies in large-scale
information seeking, underscoring urgent areas for future research and
development in agentic search. Our dataset, evaluation pipeline, and benchmark
results have been publicly released at https://widesearch-seed.github.io/