深度研究代理的嚴謹基準與多維度評估：從答案到報告

摘要

人工智能正经历从封闭语言模型向具备外部感知与信息整合能力的互联代理系统的范式转变。作为这一转变的典型代表，深度研究代理（DRAs）系统性地展现了任务分解、跨源检索、多阶段推理及结构化输出的能力，显著提升了在复杂开放任务上的表现。然而，现有基准在评估维度、响应格式及评分机制方面仍显不足，限制了其有效评估此类系统的能力。本文针对DRAs及报告式响应，引入了一套严谨的基准与多维评价框架。该基准包含214个专家精心设计的跨10大主题领域的挑战性查询，每个查询均配有手工构建的参考包以支持复合评估。该框架能够全面评估DRAs生成的长篇报告，整合了语义质量、主题聚焦及检索可信度等综合评分指标。大量实验证实，主流DRAs在性能上优于增强型网络搜索工具推理模型，但也揭示出仍有较大改进空间。本研究为DRAs系统的能力评估、架构优化及范式推进奠定了坚实基础。

English

Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.

深度研究代理的嚴謹基準與多維度評估：從答案到報告

A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

摘要

Support