ReportBench: 学術調査タスクによる深層研究エージェントの評価

要旨

ディープリサーチエージェントの登場により、大規模な研究タスクに要する時間が大幅に短縮されました。しかし、これらのタスクは本質的に事実の正確性と網羅性に対する厳格な基準を要求するため、広範な採用に先立つ徹底的な評価が必要です。本論文では、大規模言語モデル（LLMs）によって生成された研究レポートの内容品質を評価するための体系的ベンチマークであるReportBenchを提案します。我々の評価は、以下の2つの重要な側面に焦点を当てています：（1）引用文献の品質と関連性、（2）生成されたレポート内の記述の忠実性と真実性。ReportBenchは、arXivで公開されている高品質なサーベイ論文をゴールドスタンダードの参照資料として活用し、そこから逆プロンプトエンジニアリングを適用してドメイン固有のプロンプトを導出し、包括的な評価コーパスを確立します。さらに、ReportBench内にエージェントベースの自動化フレームワークを開発し、生成されたレポートを体系的に分析します。このフレームワークは、引用と記述を抽出し、引用内容の忠実性を元のソースに対してチェックし、非引用の主張をウェブベースのリソースを使用して検証します。実証評価の結果、OpenAIやGoogleが開発した商用ディープリサーチエージェントは、検索やブラウジングツールを強化したスタンドアロンのLLMsよりも、より包括的で信頼性の高いレポートを生成することが示されました。しかし、研究の広がりと深さ、および事実の一貫性の点で、まだ大幅な改善の余地があります。完全なコードとデータは以下のリンクで公開されます：https://github.com/ByteDance-BandAI/ReportBench

English

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

ReportBench: 学術調査タスクによる深層研究エージェントの評価

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

要旨

Support