DeepScholar-Bench: 生成的研究統合のためのライブベンチマークと自動評価

要旨

研究と知識の統合能力は、人間の専門性と進歩の中核をなすものである。新たに登場するシステムの一群は、生成的研究統合を通じてこれらの魅力的な能力を提供することを約束しており、ライブウェブ上での検索を行い、発見された情報源を長文で引用付きの要約に統合する。しかし、そのようなシステムを評価することは未解決の課題である：既存の質問応答ベンチマークは短い事実ベースの回答に焦点を当てており、専門家がキュレートしたデータセットは陳腐化やデータ汚染のリスクを抱えている。どちらも実際の研究統合タスクの複雑さと進化する性質を捉えることに失敗している。本研究では、DeepScholar-benchを紹介する。これは、生成的研究統合を評価するために設計されたライブベンチマークと包括的で自動化された評価フレームワークである。DeepScholar-benchは、最近の高品質なArXiv論文からクエリを抽出し、実際の研究統合タスクに焦点を当てている：先行研究を検索し、統合し、引用することで、論文の関連研究セクションを生成する。我々の評価フレームワークは、知識統合、検索品質、検証可能性という3つの主要な次元にわたってパフォーマンスを包括的に評価する。また、LOTUS APIを効率的に使用して実装された参照パイプラインであるDeepScholar-baseを開発した。DeepScholar-benchフレームワークを使用して、既存のオープンソースシステム、Search AI、OpenAIのDeepResearch、およびDeepScholar-baseの体系的な評価を行った。その結果、DeepScholar-baseは強力なベースラインを確立し、他の各手法と比較して競争力のあるまたはそれ以上のパフォーマンスを達成することがわかった。また、DeepScholar-benchはまだ飽和しておらず、すべてのメトリクスにおいて19%を超えるスコアを達成したシステムはなかった。これらの結果は、DeepScholar-benchの難しさと、生成的研究統合が可能なAIシステムに向けた進歩におけるその重要性を強調している。我々はコードをhttps://github.com/guestrin-lab/deepscholar-benchで公開している。

English

The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of 19% across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.

DeepScholar-Bench: 生成的研究統合のためのライブベンチマークと自動評価

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

要旨

Support