深度研究代理的嚴謹基準與多維度評估:從答案到報告
A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports
October 2, 2025
作者: Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, Yingchun Wang
cs.AI
摘要
人工智能正经历从封闭语言模型向具备外部感知与信息整合能力的互联代理系统的范式转变。作为这一转变的典型代表,深度研究代理(DRAs)系统性地展现了任务分解、跨源检索、多阶段推理及结构化输出的能力,显著提升了在复杂开放任务上的表现。然而,现有基准在评估维度、响应格式及评分机制方面仍显不足,限制了其有效评估此类系统的能力。本文针对DRAs及报告式响应,引入了一套严谨的基准与多维评价框架。该基准包含214个专家精心设计的跨10大主题领域的挑战性查询,每个查询均配有手工构建的参考包以支持复合评估。该框架能够全面评估DRAs生成的长篇报告,整合了语义质量、主题聚焦及检索可信度等综合评分指标。大量实验证实,主流DRAs在性能上优于增强型网络搜索工具推理模型,但也揭示出仍有较大改进空间。本研究为DRAs系统的能力评估、架构优化及范式推进奠定了坚实基础。
English
Artificial intelligence is undergoing the paradigm shift from closed language
models to interconnected agent systems capable of external perception and
information integration. As a representative embodiment, Deep Research Agents
(DRAs) systematically exhibit the capabilities for task decomposition,
cross-source retrieval, multi-stage reasoning, and structured output, which
markedly enhance performance on complex and open-ended tasks. However, existing
benchmarks remain deficient in evaluation dimensions, response formatting, and
scoring mechanisms, limiting their capacity to assess such systems effectively.
This paper introduces a rigorous benchmark and a multidimensional evaluation
framework tailored to DRAs and report-style responses. The benchmark comprises
214 expert-curated challenging queries distributed across 10 broad thematic
domains, each accompanied by manually constructed reference bundles to support
composite evaluation. The framework enables comprehensive evaluation of
long-form reports generated by DRAs, incorporating integrated scoring metrics
for semantic quality, topical focus, and retrieval trustworthiness. Extensive
experimentation confirms the superior performance of mainstream DRAs over
web-search-tool-augmented reasoning models, yet reveals considerable scope for
further improvement. This study provides a robust foundation for capability
assessment, architectural refinement, and paradigm advancement in DRA systems.