SurveyBench：大语言模型（及其代理）撰写学术综述的能力评估

摘要

学术综述写作，作为将浩瀚文献提炼为连贯且富有洞察力叙述的过程，依然是一项劳动密集且智力要求极高的任务。尽管近期方法，如通用深度研究代理和综述专用技术，能够自动生成综述（即LLM4Survey），但其输出往往难以达到人类标准，且缺乏一个严谨、以读者为导向的基准来全面揭示其不足。为填补这一空白，我们提出了一个细粒度、测验驱动的评估框架SurveyBench，其特点包括：（1）从近期的11,343篇arXiv论文及对应的4,947篇高质量综述中选取典型主题；（2）一个多层面的指标层级，评估大纲质量（如覆盖广度、逻辑一致性）、内容质量（如综合粒度、见解清晰度）以及非文本丰富性；（3）一种双模式评估协议，包含基于内容的可回答性测试和基于测验的可回答性测试，明确与读者的信息需求对齐。结果表明，SurveyBench有效挑战了现有的LLM4Survey方法（例如，在基于内容的评估中平均比人类低21%）。

English

Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

SurveyBench：大语言模型（及其代理）撰写学术综述的能力评估

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

摘要

Support