ReportBench：通过学术调研任务评估深度研究智能体

摘要

深度研究智能体的出现大幅缩短了进行广泛研究任务所需的时间。然而，这些任务本质上要求严格的事实准确性和全面性标准，因此在广泛采用前需要进行彻底评估。本文提出了ReportBench，一个系统化的基准测试，旨在评估由大型语言模型（LLMs）生成的研究报告的内容质量。我们的评估聚焦于两个关键维度：（1）引用文献的质量与相关性，以及（2）生成报告中陈述的忠实性与真实性。ReportBench利用arXiv上高质量发表的综述论文作为黄金标准参考，通过逆向提示工程从中提取领域特定的提示，并建立一个全面的评估语料库。此外，我们在ReportBench中开发了一个基于智能体的自动化框架，该框架通过提取引用和陈述，对照原始来源检查引用内容的忠实性，并利用网络资源验证非引用声明，系统分析生成报告。实证评估表明，如OpenAI和Google开发的商业深度研究智能体，相较于仅配备搜索或浏览工具的独立LLMs，能生成更为全面和可靠的报告。然而，在研究覆盖的广度与深度以及事实一致性方面，仍有显著的提升空间。完整代码与数据将在以下链接发布：https://github.com/ByteDance-BandAI/ReportBench。

English

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

ReportBench：通过学术调研任务评估深度研究智能体

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

摘要

Support