ChatPaper.aiChatPaper

SurveyBench:大语言模型(及其代理)撰写学术综述的能力评估

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

October 3, 2025
作者: Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu
cs.AI

摘要

学术综述写作,作为将浩瀚文献提炼为连贯且富有洞察力叙述的过程,依然是一项劳动密集且智力要求极高的任务。尽管近期方法,如通用深度研究代理和综述专用技术,能够自动生成综述(即LLM4Survey),但其输出往往难以达到人类标准,且缺乏一个严谨、以读者为导向的基准来全面揭示其不足。为填补这一空白,我们提出了一个细粒度、测验驱动的评估框架SurveyBench,其特点包括:(1)从近期的11,343篇arXiv论文及对应的4,947篇高质量综述中选取典型主题;(2)一个多层面的指标层级,评估大纲质量(如覆盖广度、逻辑一致性)、内容质量(如综合粒度、见解清晰度)以及非文本丰富性;(3)一种双模式评估协议,包含基于内容的可回答性测试和基于测验的可回答性测试,明确与读者的信息需求对齐。结果表明,SurveyBench有效挑战了现有的LLM4Survey方法(例如,在基于内容的评估中平均比人类低21%)。
English
Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).
PDF62October 6, 2025