DeepSearchQA:弥合深度研究智能体综合能力鸿沟
DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents
January 28, 2026
作者: Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, Dipanjan Das
cs.AI
摘要
我们推出DeepSearchQA——一个包含900个提示的基准测试集,用于评估智能体在17个不同学科领域中执行复杂多步骤信息检索任务的能力。与传统基准测试聚焦于单一答案检索或广谱事实性验证不同,DeepSearchQA采用精心设计的挑战性任务数据集,专门评估智能体执行复杂搜索计划以生成穷尽式答案列表的能力。这种设计转变明确检验了三个关键但未被充分评估的能力:1)从分散来源系统整合碎片化信息;2)通过去重和实体解析确保答案精确性;3)在开放式搜索空间中推理终止标准的能力。每个任务均构建为因果链结构,后续步骤的信息发现依赖于前序步骤的成功完成,以此检验长程规划与上下文保持能力。所有任务均基于开放网络资源,并配备可客观验证的答案集。我们对最先进智能体架构的全面评估揭示了显著性能局限:即使最先进的模型也难以平衡高召回率与精确度。我们观察到从过早终止(检索不足)到对冲行为等典型失败模式——后者表现为智能体通过撒网式提交低置信度答案人为提升召回率。这些发现凸显了当前智能体设计的重大提升空间,也确立了DeepSearchQA作为推动未来研究向更强健深度检索能力迈进的关键诊断工具地位。
English
We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. Our comprehensive evaluation of state-of-the-art agent architectures reveals significant performance limitations: even the most advanced models struggle to balance high recall with precision. We observe distinct failure modes ranging from premature stopping (under-retrieval) to hedging behaviors, where agents cast an overly wide net of low-confidence answers to artificially boost recall. These findings highlight critical headroom in current agent designs and position DeepSearchQA as an essential diagnostic tool for driving future research toward more robust, deep-research capabilities.