DR^{3}-Eval：現実的かつ再現性の高い深層研究評価を目指して

要旨

深層研究エージェント（DRA）は、計画、検索、マルチモーダル理解、レポート生成を含む複雑で長期的な研究タスクの解決を目指すが、動的なウェブ環境と曖昧なタスク定義により、その評価は依然として困難である。本論文では、マルチモーダル・マルチファイルのレポート生成における深層研究エージェントを評価するため、現実的かつ再現性のあるベンチマークDR^{3}-Evalを提案する。DR^{3}-Evalは実際のユーザ提供資料から構築され、支援文書、妨害情報、ノイズを含みつつオープンウェブの複雑性を模擬しつつ完全に検証可能な、タスク毎の静的研究サンドボックスコーパスと組み合わされる。さらに、情報想起率、事実正確性、引用網羅性、指示遵守度、深さ品質を測定する多次元評価フレームワークを導入し、人間の判断との整合性を検証する。複数の先進的言語モデルに基づく我々の開発したマルチエージェントシステムDR^{3}-Agentによる実験により、DR^{3}-Evalが極めて挑戦的であり、検索ロバスト性と幻覚制御における重大な失敗モードを明らかにすることを実証する。コードとデータは公開されている。

English

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

DR^{3}-Eval：現実的かつ再現性の高い深層研究評価を目指して

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

要旨

Support