ChatPaper.aiChatPaper

DR^{3}-评估:迈向真实可复现的深度研究评估

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

April 16, 2026
作者: Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu
cs.AI

摘要

深度研究智能体(DRA)旨在解决涉及规划、检索、多模态理解和报告生成的复杂长期研究任务,但由于动态网络环境和模糊任务定义,其评估仍面临挑战。我们提出DR^{3}-Eval基准,这是一个用于评估多模态多文件报告生成能力的真实可复现基准。该基准基于真实用户提供的材料构建,并为每项任务配备静态研究沙箱语料库,在保持完全可验证性的同时模拟开放网络的复杂性,包含支持性文档、干扰项及噪声数据。此外,我们引入多维度评估框架,从信息召回率、事实准确性、引用覆盖率、指令遵循度和深度质量五个维度进行量化评估,并验证其与人类判断的一致性。基于多款前沿语言模型构建的多智能体系统DR^{3}-Agent实验表明,DR^{3}-Eval具有高度挑战性,能有效暴露检索鲁棒性和幻觉控制方面的关键缺陷。我们的代码与数据已公开。
English
Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.
PDF211April 18, 2026