ReportBench:通过学术调研任务评估深度研究智能体
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
August 14, 2025
作者: Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia
cs.AI
摘要
深度研究智能体的出现大幅缩短了进行广泛研究任务所需的时间。然而,这些任务本质上要求严格的事实准确性和全面性标准,因此在广泛采用前需要进行彻底评估。本文提出了ReportBench,一个系统化的基准测试,旨在评估由大型语言模型(LLMs)生成的研究报告的内容质量。我们的评估聚焦于两个关键维度:(1)引用文献的质量与相关性,以及(2)生成报告中陈述的忠实性与真实性。ReportBench利用arXiv上高质量发表的综述论文作为黄金标准参考,通过逆向提示工程从中提取领域特定的提示,并建立一个全面的评估语料库。此外,我们在ReportBench中开发了一个基于智能体的自动化框架,该框架通过提取引用和陈述,对照原始来源检查引用内容的忠实性,并利用网络资源验证非引用声明,系统分析生成报告。实证评估表明,如OpenAI和Google开发的商业深度研究智能体,相较于仅配备搜索或浏览工具的独立LLMs,能生成更为全面和可靠的报告。然而,在研究覆盖的广度与深度以及事实一致性方面,仍有显著的提升空间。完整代码与数据将在以下链接发布:https://github.com/ByteDance-BandAI/ReportBench。
English
The advent of Deep Research agents has substantially reduced the time
required for conducting extensive research tasks. However, these tasks
inherently demand rigorous standards of factual accuracy and comprehensiveness,
necessitating thorough evaluation before widespread adoption. In this paper, we
propose ReportBench, a systematic benchmark designed to evaluate the content
quality of research reports generated by large language models (LLMs). Our
evaluation focuses on two critical dimensions: (1) the quality and relevance of
cited literature, and (2) the faithfulness and veracity of the statements
within the generated reports. ReportBench leverages high-quality published
survey papers available on arXiv as gold-standard references, from which we
apply reverse prompt engineering to derive domain-specific prompts and
establish a comprehensive evaluation corpus. Furthermore, we develop an
agent-based automated framework within ReportBench that systematically analyzes
generated reports by extracting citations and statements, checking the
faithfulness of cited content against original sources, and validating
non-cited claims using web-based resources. Empirical evaluations demonstrate
that commercial Deep Research agents such as those developed by OpenAI and
Google consistently generate more comprehensive and reliable reports than
standalone LLMs augmented with search or browsing tools. However, there remains
substantial room for improvement in terms of the breadth and depth of research
coverage, as well as factual consistency. The complete code and data will be
released at the following link: https://github.com/ByteDance-BandAI/ReportBench