DR^{3}-评估：迈向真实可复现的深度研究评估

摘要

深度研究智能体（DRA）旨在解决涉及规划、检索、多模态理解和报告生成的复杂长期研究任务，但由于动态网络环境和模糊任务定义，其评估仍面临挑战。我们提出DR^{3}-Eval基准，这是一个用于评估多模态多文件报告生成能力的真实可复现基准。该基准基于真实用户提供的材料构建，并为每项任务配备静态研究沙箱语料库，在保持完全可验证性的同时模拟开放网络的复杂性，包含支持性文档、干扰项及噪声数据。此外，我们引入多维度评估框架，从信息召回率、事实准确性、引用覆盖率、指令遵循度和深度质量五个维度进行量化评估，并验证其与人类判断的一致性。基于多款前沿语言模型构建的多智能体系统DR^{3}-Agent实验表明，DR^{3}-Eval具有高度挑战性，能有效暴露检索鲁棒性和幻觉控制方面的关键缺陷。我们的代码与数据已公开。

English

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

DR^{3}-评估：迈向真实可复现的深度研究评估

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

摘要

Support