WideSearch：エージェント型広範情報探索のベンチマーキング

要旨

専門的な研究から日常的な計画立案まで、多くのタスクは広範な情報探索によってボトルネックとなっており、その作業は認知的に複雑というよりも反復的です。大規模言語モデル（LLMs）の急速な発展に伴い、LLMを活用した自動検索エージェントは、人間をこの退屈な作業から解放する有望な解決策を提供します。しかし、これらのエージェントが「広範な文脈」での情報収集を確実かつ完全に実行する能力は、適切なベンチマークの不足により、ほとんど評価されていません。このギャップを埋めるため、我々はWideSearchを導入しました。これは、大規模な情報収集タスクにおけるエージェントの信頼性を評価するために設計された新しいベンチマークです。このベンチマークは、15以上の多様な分野から集めた200の手作業で作成された質問（英語100問、中国語100問）を特徴とし、実際のユーザークエリに基づいています。各タスクでは、エージェントが大規模な原子情報を収集し、それを客観的に一つずつ検証可能な形で整理し、整然とした出力にまとめることが求められます。厳格な5段階の品質管理パイプラインにより、データセットの難易度、完全性、検証可能性が保証されています。我々は、シングルエージェント、マルチエージェントフレームワーク、エンドツーエンドの商用システムを含む10以上の最先端の検索エージェントシステムをベンチマークしました。ほとんどのシステムの全体成功率は0\%に近く、最高のパフォーマンスを示したシステムでもわずか5\%でした。しかし、十分な時間を与えられれば、複数の人間によるクロスチェックにより、ほぼ100\%の成功率を達成できます。これらの結果は、現在の検索エージェントが大規模な情報探索において重大な欠陥を抱えていることを示しており、検索エージェントの今後の研究開発における緊急の課題を浮き彫りにしています。我々のデータセット、評価パイプライン、ベンチマーク結果は、https://widesearch-seed.github.io/ で公開されています。

English

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

WideSearch：エージェント型広範情報探索のベンチマーキング

WideSearch: Benchmarking Agentic Broad Info-Seeking

要旨

Support