通过报告理解深度研究

摘要

DeepResearch智能体代表了一种变革性的人工智能范式，通过复杂的推理与多工具集成，执行专家级的研究任务。然而，评估这类系统仍面临重大挑战，原因在于开放式研究场景及现有基准测试多聚焦于孤立能力而非整体表现。与传统的LLM任务不同，DeepResearch系统需综合多样来源、生成洞见并呈现连贯的研究成果，这些能力难以通过简单验证来评估。为填补这一空白，我们推出了DeepResearch-ReportEval，一个旨在通过其最具代表性的输出——研究报告——来评估DeepResearch系统的综合框架。我们的方法系统性地衡量了三个维度：质量、冗余度与事实性，采用创新的“LLM作为评判者”方法论，实现了与专家意见的高度一致。我们贡献了一个包含100个精选查询的标准化基准，覆盖12个现实世界类别，便于系统能力比较。通过对四个领先商业系统的评估，揭示了不同的设计理念与性能权衡，为DeepResearch从信息助手向智能研究伙伴的演进奠定了基础见解。源代码与数据可在以下网址获取：https://github.com/HKUDS/DeepResearch-Eval。

English

DeepResearch agents represent a transformative AI paradigm, conducting expert-level research through sophisticated reasoning and multi-tool integration. However, evaluating these systems remains critically challenging due to open-ended research scenarios and existing benchmarks that focus on isolated capabilities rather than holistic performance. Unlike traditional LLM tasks, DeepResearch systems must synthesize diverse sources, generate insights, and present coherent findings, which are capabilities that resist simple verification. To address this gap, we introduce DeepResearch-ReportEval, a comprehensive framework designed to assess DeepResearch systems through their most representative outputs: research reports. Our approach systematically measures three dimensions: quality, redundancy, and factuality, using an innovative LLM-as-a-Judge methodology achieving strong expert concordance. We contribute a standardized benchmark of 100 curated queries spanning 12 real-world categories, enabling systematic capability comparison. Our evaluation of four leading commercial systems reveals distinct design philosophies and performance trade-offs, establishing foundational insights as DeepResearch evolves from information assistants toward intelligent research partners. Source code and data are available at: https://github.com/HKUDS/DeepResearch-Eval.

通过报告理解深度研究

Understanding DeepResearch via Reports

摘要

Support