DeepScholar-Bench:生成式研究綜述的實時基準與自動化評估
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
August 27, 2025
作者: Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin
cs.AI
摘要
研究和綜合知識的能力是人類專業知識與進步的核心。一類新興系統通過生成式研究綜合,承諾提供這些令人興奮的能力,它們在實時網絡上進行檢索,並將發現的資料綜合成長篇、引用的摘要。然而,評估此類系統仍是一個開放性挑戰:現有的問答基準側重於短篇事實性回應,而專家策劃的數據集則面臨過時和數據污染的風險。兩者都未能捕捉到真實研究綜合任務的複雜性和演變性。在本研究中,我們引入了DeepScholar-bench,這是一個實時基準和全面的自動化評估框架,旨在評估生成式研究綜合。DeepScholar-bench從近期高質量的ArXiv論文中提取查詢,並專注於一個真實的研究綜合任務:通過檢索、綜合和引用先前研究,生成論文的相關工作部分。我們的評估框架全面評估了三個關鍵維度的表現:知識綜合、檢索質量和可驗證性。我們還開發了DeepScholar-base,這是一個使用LOTUS API高效實現的參考管道。利用DeepScholar-bench框架,我們對先前的開源系統、搜索AI、OpenAI的DeepResearch和DeepScholar-base進行了系統性評估。我們發現DeepScholar-base建立了一個強有力的基準,其表現與其他方法相比具有競爭力或更高。我們還發現DeepScholar-bench遠未飽和,沒有任何系統在所有指標上超過19%的分數。這些結果凸顯了DeepScholar-bench的難度,以及其對於推動具備生成式研究綜合能力的AI系統進步的重要性。我們的代碼可在https://github.com/guestrin-lab/deepscholar-bench 獲取。
English
The ability to research and synthesize knowledge is central to human
expertise and progress. An emerging class of systems promises these exciting
capabilities through generative research synthesis, performing retrieval over
the live web and synthesizing discovered sources into long-form, cited
summaries. However, evaluating such systems remains an open challenge: existing
question-answering benchmarks focus on short-form factual responses, while
expert-curated datasets risk staleness and data contamination. Both fail to
capture the complexity and evolving nature of real research synthesis tasks. In
this work, we introduce DeepScholar-bench, a live benchmark and holistic,
automated evaluation framework designed to evaluate generative research
synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv
papers and focuses on a real research synthesis task: generating the related
work sections of a paper by retrieving, synthesizing, and citing prior
research. Our evaluation framework holistically assesses performance across
three key dimensions, knowledge synthesis, retrieval quality, and
verifiability. We also develop DeepScholar-base, a reference pipeline
implemented efficiently using the LOTUS API. Using the DeepScholar-bench
framework, we perform a systematic evaluation of prior open-source systems,
search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that
DeepScholar-base establishes a strong baseline, attaining competitive or higher
performance than each other method. We also find that DeepScholar-bench remains
far from saturated, with no system exceeding a score of 19% across all
metrics. These results underscore the difficulty of DeepScholar-bench, as well
as its importance for progress towards AI systems capable of generative
research synthesis. We make our code available at
https://github.com/guestrin-lab/deepscholar-bench.