SurveyBench:大型語言模型(及其代理)撰寫學術綜述的能力如何?
SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?
October 3, 2025
作者: Zhaojun Sun, Xuzhou Zhu, Xuanhe Zhou, Xin Tong, Shuo Wang, Jie Fu, Guoliang Li, Zhiyuan Liu, Fan Wu
cs.AI
摘要
學術綜述寫作,作為將浩瀚文獻提煉為連貫且富有洞察力敘述的過程,依然是一項耗時且對智力要求極高的任務。儘管近期方法,如通用深度研究代理和專注於綜述生成的方法,能夠自動生成綜述(即LLM4Survey),但其輸出往往難以達到人類標準,且缺乏一個嚴謹、與讀者需求對齊的基準來全面揭示其不足。為填補這一空白,我們提出了一個細粒度、基於測驗的評估框架SurveyBench,其特點包括:(1)從近期的11,343篇arXiv論文及對應的4,947篇高質量綜述中選取典型綜述主題;(2)一個多維度的指標體系,評估大綱質量(如覆蓋廣度、邏輯連貫性)、內容質量(如綜合粒度、洞察清晰度)以及非文本豐富性;(3)雙模式評估協議,包含基於內容和基於測驗的可回答性測試,明確對齊讀者的信息需求。結果表明,SurveyBench有效挑戰了現有的LLM4Survey方法(例如,在基於內容的評估中平均低於人類21%)。
English
Academic survey writing, which distills vast literature into a coherent and
insightful narrative, remains a labor-intensive and intellectually demanding
task. While recent approaches, such as general DeepResearch agents and
survey-specialized methods, can generate surveys automatically (a.k.a.
LLM4Survey), their outputs often fall short of human standards and there lacks
a rigorous, reader-aligned benchmark for thoroughly revealing their
deficiencies. To fill the gap, we propose a fine-grained, quiz-driven
evaluation framework SurveyBench, featuring (1) typical survey topics source
from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys;
(2) a multifaceted metric hierarchy that assesses the outline quality (e.g.,
coverage breadth, logical coherence), content quality (e.g., synthesis
granularity, clarity of insights), and non-textual richness; and (3) a
dual-mode evaluation protocol that includes content-based and quiz-based
answerability tests, explicitly aligned with readers' informational needs.
Results show SurveyBench effectively challenges existing LLM4Survey approaches
(e.g., on average 21% lower than human in content-based evaluation).