深度研究基准：面向深度研究代理的全面评测体系

摘要

深度研究代理（Deep Research Agents, DRAs）是基于大型语言模型（LLM）代理的重要类别。通过自主编排多步骤的网络探索、定向检索及高阶综合，它们将海量在线信息转化为分析师级别、引用丰富的报告——将数小时的手动案头研究压缩至几分钟内完成。然而，目前尚缺乏一个系统评估这些代理能力的全面基准。为填补这一空白，我们提出了DeepResearch Bench，这是一个包含100个博士级别研究任务的基准，每个任务均由22个不同领域的专家精心设计。评估DRAs本质上是复杂且劳动密集型的，因此我们提出了两种新颖的方法论，以实现与人类判断的高度一致。第一种是基于参考的方法，采用自适应标准来评估生成的研究报告质量。另一种框架则通过评估有效引用数量及整体引用准确性，来衡量DRA的信息检索与收集能力。我们已在https://github.com/Ayanami0730/deep_research_bench开源了DeepResearch Bench及这些框架的关键组件，以加速实用型LLM代理的发展。

English

Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.

深度研究基准：面向深度研究代理的全面评测体系

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

摘要

Support