DR³-Eval: 현실적이고 재현 가능한 딥 리서치 평가를 위한 방법론

초록

딥 리서치 에이전트(DRA)는 계획, 검색, 멀티모달 이해, 보고서 생성을 포함한 복잡하고 장기적인 연구 과제를 해결하는 것을 목표로 하지만, 동적인 웹 환경과 모호한 과제 정의로 인해 평가는 여전히 어려운 과제로 남아 있습니다. 본 연구에서는 멀티모달, 다중 파일 보고서 생성에 대한 딥 리서치 에이전트 평가를 위한 현실적이고 재현 가능한 벤치마크인 DR^{3}-Eval을 제안합니다. DR^{3}-Eval은 실제 사용자가 제공한 자료로 구성되었으며, 지원 문서, 주의 분산 요소, 노이즈를 포함하여 개방형 웹의 복잡성을 시뮬레이션하면서도 완전히 검증 가능한 과제별 정적 연구 샌드박스 코퍼스와 쌍을 이룹니다. 더불어 정보 재현율, 사실 정확성, 인용 범위, 지시 따르기, 심도 질을 측정하는 다차원 평가 프레임워크를 도입하고 이의 인간 평가와의 일치도를 검증합니다. 최첨단 다중 언어 모델 기반의 다중 에이전트 시스템인 DR^{3}-Agent를 이용한 실험을 통해 DR^{3}-Eval이 매우 도전적인 과제이며 검색 견고성과 환각 통제에서 중요한 실패 모드를 드러낸다는 것을 입증합니다. 저희 코드와 데이터는 공개되어 있습니다.

English

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

DR³-Eval: 현실적이고 재현 가능한 딥 리서치 평가를 위한 방법론

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

초록

Support