DeepScholar-Bench：面向生成式研究综述的实时基准与自动化评估平台

摘要

研究与知识综合能力是人类专业素养与进步的核心。一类新兴系统通过生成式研究综合展现了这些令人振奋的能力，它们能在实时网络中进行检索，并将发现的资源综合成长篇、带引用的摘要。然而，评估此类系统仍是一个开放性的挑战：现有的问答基准主要关注简短的事实性回答，而专家策划的数据集则存在陈旧和数据污染的风险。两者均未能捕捉到真实研究综合任务的复杂性和动态演变特性。在本研究中，我们推出了DeepScholar-bench，这是一个实时基准测试及全面的自动化评估框架，旨在评估生成式研究综合。DeepScholar-bench从近期高质量的ArXiv论文中提取查询，专注于一项真实的研究综合任务：通过检索、综合并引用先前研究，生成论文的相关工作部分。我们的评估框架从知识综合、检索质量和可验证性三个关键维度全面评估性能。同时，我们开发了DeepScholar-base，一个利用LOTUS API高效实现的参考流程。借助DeepScholar-bench框架，我们对先前的开源系统、搜索AI、OpenAI的DeepResearch以及DeepScholar-base进行了系统评估。结果表明，DeepScholar-base建立了强有力的基线，其性能与每种方法相比均具有竞争力或更优。此外，我们发现DeepScholar-bench远未达到饱和，所有系统在所有指标上的得分均未超过19%。这些结果凸显了DeepScholar-bench的难度，以及其对推动具备生成式研究综合能力的AI系统发展的重要性。我们的代码已发布于https://github.com/guestrin-lab/deepscholar-bench。

English

The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of 19% across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.

DeepScholar-Bench：面向生成式研究综述的实时基准与自动化评估平台

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

摘要

Support