와이드서치(WideSearch): 에이전트 기반 광범위 정보 탐색 벤치마킹

초록

전문 연구부터 일상적인 계획 수립에 이르기까지, 많은 작업들이 광범위한 정보 탐색으로 인해 병목 현상을 겪고 있으며, 이러한 탐색은 인지적으로 복잡하기보다는 반복적인 성격이 강합니다. 대규모 언어 모델(LLM)의 급속한 발전과 함께, LLM 기반의 자동화된 검색 에이전트는 인간을 이러한 지루한 작업에서 해방시킬 유망한 해결책을 제공합니다. 그러나 이러한 "광범위한 맥락" 정보 수집을 신뢰할 수 있고 완전하게 수행할 수 있는 에이전트의 능력은 적절한 벤치마크의 부재로 인해 크게 평가되지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 대규모 수집 작업에서 에이전트의 신뢰성을 평가하기 위해 설계된 새로운 벤치마크인 WideSearch를 소개합니다. 이 벤치마크는 15개 이상의 다양한 분야에서 실제 사용자 쿼리를 기반으로 수작업으로 선별된 200개의 질문(영어 100개, 중국어 100개)을 특징으로 합니다. 각 작업은 에이전트가 대규모의 원자적 정보를 수집하고 이를 객관적으로 하나씩 검증할 수 있으며, 잘 정리된 출력으로 배열하도록 요구합니다. 엄격한 5단계 품질 관리 파이프라인은 데이터셋의 난이도, 완전성 및 검증 가능성을 보장합니다. 우리는 단일 에이전트, 다중 에이전트 프레임워크 및 종단간 상용 시스템을 포함한 10개 이상의 최첨단 검색 에이전트 시스템을 벤치마크했습니다. 대부분의 시스템은 전체 성공률이 0%에 가까웠으며, 최고 성능을 보인 시스템도 단 5%에 그쳤습니다. 그러나 충분한 시간이 주어진다면, 여러 인간 테스터의 교차 검증을 통해 거의 100%의 성공률을 달성할 수 있습니다. 이러한 결과는 현재의 검색 에이전트들이 대규모 정보 탐색에서 심각한 결함을 가지고 있음을 보여주며, 검색 에이전트 분야의 향후 연구 및 개발이 시급한 영역임을 강조합니다. 우리의 데이터셋, 평가 파이프라인 및 벤치마크 결과는 https://widesearch-seed.github.io/에서 공개되었습니다.

English

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

와이드서치(WideSearch): 에이전트 기반 광범위 정보 탐색 벤치마킹

WideSearch: Benchmarking Agentic Broad Info-Seeking

초록

Support