DeepResearch Bench: 딥 리서치 에이전트를 위한 포괄적인 벤치마크

초록

딥 리서치 에이전트(Deep Research Agents)는 LLM 기반 에이전트의 주요 범주 중 하나입니다. 이들은 다단계 웹 탐색, 타겟팅된 정보 검색, 그리고 고차원적 통합을 자율적으로 조율함으로써 방대한 양의 온라인 정보를 분석가 수준의 인용이 풍부한 보고서로 변환합니다. 이는 수 시간에 걸친 수동 데스크 리서치를 단 몇 분으로 압축하는 효과를 가져옵니다. 그러나 이러한 에이전트의 능력을 체계적으로 평가하기 위한 포괄적인 벤치마크는 아직 부재합니다. 이러한 격차를 해소하기 위해, 우리는 22개의 다양한 분야에 걸쳐 도메인 전문가들이 세심하게 설계한 100개의 박사 수준 연구 과제로 구성된 DeepResearch Bench를 제시합니다. DRAs를 평가하는 것은 본질적으로 복잡하고 노동 집약적인 작업입니다. 따라서 우리는 인간의 판단과 강력한 일치를 달성하는 두 가지 새로운 방법론을 제안합니다. 첫 번째는 생성된 연구 보고서의 품질을 평가하기 위한 적응형 기준을 갖춘 참조 기반 방법입니다. 다른 프레임워크는 DRA의 정보 검색 및 수집 능력을 평가하기 위해 효과적인 인용 수와 전반적인 인용 정확도를 평가하는 방식으로 소개됩니다. 우리는 실용적인 LLM 기반 에이전트의 개발을 가속화하기 위해 DeepResearch Bench와 이러한 프레임워크의 주요 구성 요소를 https://github.com/Ayanami0730/deep_research_bench에서 오픈소스로 공개했습니다.

English

Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.