ChatPaper.aiChatPaper

我們距離真正實用的深度研究智能體還有多遠?

How Far Are We from Genuinely Useful Deep Research Agents?

December 1, 2025
作者: Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou
cs.AI

摘要

深度研究智能體(DRA)旨在通過迭代式資訊檢索與綜合分析,自動生成達到分析師水準的研究報告。然而現有DRA大多在問答基準測試中進行驗證,針對綜合性報告生成的研究仍被忽視。更嚴重的是,當前報告合成基準測試存在任務複雜性與主觀評估指標的雙重缺陷,既無法反映真實用戶需求,也限制了生成報告的實用價值。為解決這些問題,我們提出精細化深度研究基準(FINDER),該增強型基準包含100項人工策劃的研究任務與419個結構化檢查項,可標準化報告結構、分析深度與事實依據。基於主流DRA生成的近千份報告,我們進一步提出深度研究失效分類法(DEFT),這是首個針對深度研究智能體的失效分類體系。DEFT包含推理、檢索與生成三大維度的14種細粒度失效模式,並紮根於質性研究方法論,採用人機協同標注與標注者間信度驗證機制。實驗結果表明,當前DRA的瓶頸不在任務理解能力,而在證據整合、事實核查以及具備推理韌性的規劃能力。
English
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
PDF441December 3, 2025