ChatPaper.aiChatPaper

DR^{3}-評估:邁向真實且可重現的深度研究評估

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

April 16, 2026
作者: Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu
cs.AI

摘要

深度研究智能體(DRA)旨在解決涉及規劃、檢索、多模態理解與報告生成的複雜長期研究任務,但其評估仍因動態網絡環境與模糊任務定義而面臨挑戰。我們提出DR^{3}-Eval——一個用於評估深度研究智能體在多模態、多文件報告生成能力的真實可重現基準。該基準基於真實用戶提供的材料構建,並配備每項任務專用的靜態研究沙箱語料庫,在保持完全可驗證性的同時模擬開放網絡的複雜性,包含支持性文獻、干擾項與噪聲數據。此外,我們引入多維度評估框架,涵蓋信息召回率、事實準確性、引用覆蓋度、指令遵循度與深度質量,並驗證其與人類評判的一致性。基於多個尖端語言模型開發的多智能體系統DR^{3}-Agent實驗表明,DR^{3}-Eval具有高度挑戰性,能有效揭示檢索魯棒性與幻覺控制方面的關鍵失效模式。我們的代碼與數據已公開。
English
Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.
PDF211April 18, 2026