DR^{3}-評估：邁向真實且可重現的深度研究評估

摘要

深度研究智能體（DRA）旨在解決涉及規劃、檢索、多模態理解與報告生成的複雜長期研究任務，但其評估仍因動態網絡環境與模糊任務定義而面臨挑戰。我們提出DR^{3}-Eval——一個用於評估深度研究智能體在多模態、多文件報告生成能力的真實可重現基準。該基準基於真實用戶提供的材料構建，並配備每項任務專用的靜態研究沙箱語料庫，在保持完全可驗證性的同時模擬開放網絡的複雜性，包含支持性文獻、干擾項與噪聲數據。此外，我們引入多維度評估框架，涵蓋信息召回率、事實準確性、引用覆蓋度、指令遵循度與深度質量，並驗證其與人類評判的一致性。基於多個尖端語言模型開發的多智能體系統DR^{3}-Agent實驗表明，DR^{3}-Eval具有高度挑戰性，能有效揭示檢索魯棒性與幻覺控制方面的關鍵失效模式。我們的代碼與數據已公開。

English

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

DR^{3}-評估：邁向真實且可重現的深度研究評估

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

摘要

Support