DREAM：基于智能体度量的深度研究评估

摘要

深度研究智能体能够生成分析师级别的报告，但由于缺乏单一标准答案及研究质量的多维特性，其评估仍面临挑战。现有基准测试虽提出不同方法，却陷入"合成幻象"困境——表面流畅度与文献引用高度吻合可能掩盖底层事实错误与推理缺陷。我们通过构建四维分类法揭示关键能力错配：静态评估器先天缺乏评估时效性与事实准确性所需的工具调用能力。为此，我们提出DREAM框架（基于智能体指标的深度研究评估），通过使评估过程本身具备智能体特性来实现能力对等原则。DREAM采用结合查询无关指标与工具调用智能体生成自适应指标的评估协议，实现时效感知覆盖、事实核查验证及系统化推理探测。受控实验表明，DREAM对事实错误与时效性衰退的检测灵敏度显著优于现有基准，为规模化无参考评估提供了新范式。

English

Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.

DREAM：基于智能体度量的深度研究评估

DREAM: Deep Research Evaluation with Agentic Metrics

摘要

Support