通过报告理解深度研究
Understanding DeepResearch via Reports
October 9, 2025
作者: Tianyu Fan, Xinyao Niu, Yuxiang Zheng, Fengji Zhang, Chengen Huang, Bei Chen, Junyang Lin, Chao Huang
cs.AI
摘要
DeepResearch智能体代表了一种变革性的人工智能范式,通过复杂的推理与多工具集成,执行专家级的研究任务。然而,评估这类系统仍面临重大挑战,原因在于开放式研究场景及现有基准测试多聚焦于孤立能力而非整体表现。与传统的LLM任务不同,DeepResearch系统需综合多样来源、生成洞见并呈现连贯的研究成果,这些能力难以通过简单验证来评估。为填补这一空白,我们推出了DeepResearch-ReportEval,一个旨在通过其最具代表性的输出——研究报告——来评估DeepResearch系统的综合框架。我们的方法系统性地衡量了三个维度:质量、冗余度与事实性,采用创新的“LLM作为评判者”方法论,实现了与专家意见的高度一致。我们贡献了一个包含100个精选查询的标准化基准,覆盖12个现实世界类别,便于系统能力比较。通过对四个领先商业系统的评估,揭示了不同的设计理念与性能权衡,为DeepResearch从信息助手向智能研究伙伴的演进奠定了基础见解。源代码与数据可在以下网址获取:https://github.com/HKUDS/DeepResearch-Eval。
English
DeepResearch agents represent a transformative AI paradigm, conducting
expert-level research through sophisticated reasoning and multi-tool
integration. However, evaluating these systems remains critically challenging
due to open-ended research scenarios and existing benchmarks that focus on
isolated capabilities rather than holistic performance. Unlike traditional LLM
tasks, DeepResearch systems must synthesize diverse sources, generate insights,
and present coherent findings, which are capabilities that resist simple
verification. To address this gap, we introduce DeepResearch-ReportEval, a
comprehensive framework designed to assess DeepResearch systems through their
most representative outputs: research reports. Our approach systematically
measures three dimensions: quality, redundancy, and factuality, using an
innovative LLM-as-a-Judge methodology achieving strong expert concordance. We
contribute a standardized benchmark of 100 curated queries spanning 12
real-world categories, enabling systematic capability comparison. Our
evaluation of four leading commercial systems reveals distinct design
philosophies and performance trade-offs, establishing foundational insights as
DeepResearch evolves from information assistants toward intelligent research
partners. Source code and data are available at:
https://github.com/HKUDS/DeepResearch-Eval.