ChatPaper.aiChatPaper

我们距离真正实用的深度研究智能体还有多远?

How Far Are We from Genuinely Useful Deep Research Agents?

December 1, 2025
作者: Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou
cs.AI

摘要

深度研究智能体(DRA)旨在通过迭代式信息检索与综合自动生成分析师级别的研究报告。然而现有DRA大多在问答基准测试中进行验证,而针对综合性报告生成的研究仍被忽视。更严峻的是,当前报告合成基准存在任务复杂度高和评估指标主观性强的问题——这既无法反映真实用户需求,也限制了生成报告的实际应用价值。为填补这些空白,我们提出细粒度深度研究基准(FINDER),该增强型基准包含100项人工策划的研究任务与419条结构化检查项,可标准化报告结构、分析深度和事实依据。基于主流DRA生成的近千份报告,我们进一步提出深度研究失败分类体系(DEFT),这是首个针对深度研究智能体的故障分类框架。DEFT涵盖推理、检索与生成三大维度的14种细粒度故障模式,其构建基于扎根理论并采用人机协同标注与标注者间一致性验证。实验结果表明,当前DRA的瓶颈并非任务理解能力,而是证据整合、事实核查及抗干扰推理规划方面的不足。
English
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
PDF441December 3, 2025