SurveyBench: LLM（エージェント）は学術サーベイをどれだけうまく書けるか？

要旨

学術サーベイ論文の執筆は、膨大な文献を一貫性のある洞察に富んだナラティブに凝縮する作業であり、依然として労力を要し、知的に要求の高いタスクです。近年のアプローチ、例えば一般的なDeepResearchエージェントやサーベイ特化型の手法は、自動的にサーベイを生成することが可能です（いわゆるLLM4Survey）。しかし、その出力は人間の基準に及ばないことが多く、その欠陥を徹底的に明らかにするための厳密で読者に沿ったベンチマークが不足しています。このギャップを埋めるため、我々は細粒度のクイズ駆動型評価フレームワークSurveyBenchを提案します。その特徴は、(1) 最近の11,343件のarXiv論文と対応する4,947件の高品質なサーベイから得られる典型的なサーベイトピック、(2) アウトラインの品質（例：カバレッジの広さ、論理的一貫性）、コンテンツの品質（例：合成の粒度、洞察の明瞭さ）、および非テキスト的な豊かさを評価する多面的なメトリック階層、(3) 読者の情報ニーズに明示的に沿ったコンテンツベースとクイズベースの回答可能性テストを含むデュアルモード評価プロトコルです。結果は、SurveyBenchが既存のLLM4Surveyアプローチに効果的に挑戦することを示しています（例：コンテンツベース評価では平均21%人間より低い）。

English

Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

SurveyBench: LLM（エージェント）は学術サーベイをどれだけうまく書けるか？

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

要旨

Support