深度研究基准:面向深度研究代理的全面评测体系
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
June 13, 2025
作者: Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao
cs.AI
摘要
深度研究代理(Deep Research Agents, DRAs)是基于大型语言模型(LLM)代理的重要类别。通过自主编排多步骤的网络探索、定向检索及高阶综合,它们将海量在线信息转化为分析师级别、引用丰富的报告——将数小时的手动案头研究压缩至几分钟内完成。然而,目前尚缺乏一个系统评估这些代理能力的全面基准。为填补这一空白,我们提出了DeepResearch Bench,这是一个包含100个博士级别研究任务的基准,每个任务均由22个不同领域的专家精心设计。评估DRAs本质上是复杂且劳动密集型的,因此我们提出了两种新颖的方法论,以实现与人类判断的高度一致。第一种是基于参考的方法,采用自适应标准来评估生成的研究报告质量。另一种框架则通过评估有效引用数量及整体引用准确性,来衡量DRA的信息检索与收集能力。我们已在https://github.com/Ayanami0730/deep_research_bench开源了DeepResearch Bench及这些框架的关键组件,以加速实用型LLM代理的发展。
English
Deep Research Agents are a prominent category of LLM-based agents. By
autonomously orchestrating multistep web exploration, targeted retrieval, and
higher-order synthesis, they transform vast amounts of online information into
analyst-grade, citation-rich reports--compressing hours of manual desk research
into minutes. However, a comprehensive benchmark for systematically evaluating
the capabilities of these agents remains absent. To bridge this gap, we present
DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks,
each meticulously crafted by domain experts across 22 distinct fields.
Evaluating DRAs is inherently complex and labor-intensive. We therefore propose
two novel methodologies that achieve strong alignment with human judgment. The
first is a reference-based method with adaptive criteria to assess the quality
of generated research reports. The other framework is introduced to evaluate
DRA's information retrieval and collection capabilities by assessing its
effective citation count and overall citation accuracy. We have open-sourced
DeepResearch Bench and key components of these frameworks at
https://github.com/Ayanami0730/deep_research_bench to accelerate the
development of practical LLM-based agents.