ChatPaper.aiChatPaper

ReportBench:透過學術調查任務評估深度研究代理

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

August 14, 2025
作者: Minghao Li, Ying Zeng, Zhihao Cheng, Cong Ma, Kai Jia
cs.AI

摘要

深度研究代理的出現大幅縮短了進行廣泛研究任務所需的時間。然而,這些任務本質上要求嚴格的事實準確性和全面性,因此在廣泛採用之前需要進行徹底的評估。在本文中,我們提出了ReportBench,這是一個系統化的基準,旨在評估由大型語言模型(LLMs)生成的研究報告的內容質量。我們的評估聚焦於兩個關鍵維度:(1)引用文獻的質量和相關性,以及(2)生成報告中陳述的真實性和準確性。ReportBench利用arXiv上發表的高質量綜述論文作為黃金標準參考,通過逆向提示工程從中提取特定領域的提示,並建立一個全面的評估語料庫。此外,我們在ReportBench中開發了一個基於代理的自動化框架,該框架系統地分析生成報告,通過提取引用和陳述,檢查引用內容與原始來源的真實性,並使用基於網絡的資源驗證非引用聲明。實證評估表明,由OpenAI和Google等開發的商業深度研究代理比配備搜索或瀏覽工具的獨立LLMs生成更全面和可靠的報告。然而,在研究覆蓋的廣度和深度以及事實一致性方面仍有顯著的改進空間。完整的代碼和數據將在以下鏈接發布:https://github.com/ByteDance-BandAI/ReportBench。
English
The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench
PDF71August 27, 2025