ReportBench: 학술 설문 조사 과제를 통해 심층 연구 에이전트 평가하기

초록

딥 리서치 에이전트의 등장으로 광범위한 연구 작업 수행에 필요한 시간이 크게 단축되었습니다. 그러나 이러한 작업은 본질적으로 사실적 정확성과 포괄성에 대한 엄격한 기준을 요구하며, 광범위한 채택 전에 철저한 평가가 필요합니다. 본 논문에서는 대규모 언어 모델(LLMs)이 생성한 연구 보고서의 내용 품질을 평가하기 위해 설계된 체계적인 벤치마크인 ReportBench을 제안합니다. 우리의 평가는 두 가지 중요한 차원에 초점을 맞추고 있습니다: (1) 인용된 문헌의 품질과 관련성, 그리고 (2) 생성된 보고서 내 진술의 신뢰성과 사실성. ReportBench은 arXiv에서 제공되는 고품질의 발표된 서베이 논문을 골드 스탠다드 참조 자료로 활용하며, 이를 통해 역 프롬프트 엔지니어링을 적용하여 도메인별 프롬프트를 도출하고 포괄적인 평가 코퍼스를 구축합니다. 또한, ReportBench 내에서 에이전트 기반의 자동화 프레임워크를 개발하여 생성된 보고서를 체계적으로 분석합니다. 이 프레임워크는 인용과 진술을 추출하고, 인용된 내용의 신뢰성을 원본 소스와 대조하여 확인하며, 비인용 주장은 웹 기반 리소스를 사용하여 검증합니다. 실증적 평가 결과, OpenAI와 Google이 개발한 상용 딥 리서치 에이전트는 검색 또는 브라우징 도구가 보강된 독립형 LLMs보다 더 포괄적이고 신뢰할 수 있는 보고서를 일관되게 생성하는 것으로 나타났습니다. 그러나 연구 범위의 폭과 깊이, 그리고 사실적 일관성 측면에서 여전히 상당한 개선의 여지가 있습니다. 전체 코드와 데이터는 다음 링크에서 공개될 예정입니다: https://github.com/ByteDance-BandAI/ReportBench

English

The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench

ReportBench: 학술 설문 조사 과제를 통해 심층 연구 에이전트 평가하기

ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

초록

Support