DeepScholar-Bench: 생성적 연구 종합을 위한 실시간 벤치마크 및 자동화된 평가 시스템

초록

연구를 수행하고 지식을 종합하는 능력은 인간의 전문성과 진보의 핵심입니다. 최근 등장한 일련의 시스템들은 생성적 연구 종합을 통해 이러한 흥미로운 능력을 제공할 것을 약속하며, 실시간 웹 검색을 수행하고 발견된 자료를 인용된 장문의 요약으로 종합합니다. 그러나 이러한 시스템을 평가하는 것은 여전히 열려 있는 과제입니다: 기존의 질의응답 벤치마크는 단문의 사실적 응답에 초점을 맞추고 있으며, 전문가가 선별한 데이터셋은 신선도를 잃거나 데이터 오염의 위험에 처해 있습니다. 둘 다 실제 연구 종합 작업의 복잡성과 진화하는 특성을 포착하지 못합니다. 본 연구에서는 생성적 연구 종합을 평가하기 위해 설계된 실시간 벤치마크이자 종합적이고 자동화된 평가 프레임워크인 DeepScholar-bench를 소개합니다. DeepScholar-bench는 최근의 고품질 ArXiv 논문에서 질의를 추출하고, 선행 연구를 검색, 종합, 인용하여 논문의 관련 연구 섹션을 생성하는 실제 연구 종합 작업에 초점을 맞춥니다. 우리의 평가 프레임워크는 지식 종합, 검색 품질, 검증 가능성이라는 세 가지 핵심 차원에서 성능을 종합적으로 평가합니다. 또한 LOTUS API를 효율적으로 사용하여 구현된 참조 파이프라인인 DeepScholar-base를 개발했습니다. DeepScholar-bench 프레임워크를 사용하여, 기존의 오픈소스 시스템, Search AI, OpenAI의 DeepResearch, 그리고 DeepScholar-base에 대한 체계적인 평가를 수행했습니다. 그 결과, DeepScholar-base는 강력한 기준선을 확립하며, 각 방법보다 경쟁력 있거나 더 높은 성능을 달성했습니다. 또한 DeepScholar-bench는 아직 포화 상태와는 거리가 멀어, 모든 메트릭에서 19%를 초과하는 시스템이 없었습니다. 이러한 결과는 DeepScholar-bench의 어려움과 생성적 연구 종합이 가능한 AI 시스템을 향한 진전의 중요성을 강조합니다. 우리는 코드를 https://github.com/guestrin-lab/deepscholar-bench에서 공개합니다.

English

The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of 19% across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.

DeepScholar-Bench: 생성적 연구 종합을 위한 실시간 벤치마크 및 자동화된 평가 시스템

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

초록

Support