SurveyBench：大型語言模型（及其代理）撰寫學術綜述的能力如何？

摘要

學術綜述寫作，作為將浩瀚文獻提煉為連貫且富有洞察力敘述的過程，依然是一項耗時且對智力要求極高的任務。儘管近期方法，如通用深度研究代理和專注於綜述生成的方法，能夠自動生成綜述（即LLM4Survey），但其輸出往往難以達到人類標準，且缺乏一個嚴謹、與讀者需求對齊的基準來全面揭示其不足。為填補這一空白，我們提出了一個細粒度、基於測驗的評估框架SurveyBench，其特點包括：（1）從近期的11,343篇arXiv論文及對應的4,947篇高質量綜述中選取典型綜述主題；（2）一個多維度的指標體系，評估大綱質量（如覆蓋廣度、邏輯連貫性）、內容質量（如綜合粒度、洞察清晰度）以及非文本豐富性；（3）雙模式評估協議，包含基於內容和基於測驗的可回答性測試，明確對齊讀者的信息需求。結果表明，SurveyBench有效挑戰了現有的LLM4Survey方法（例如，在基於內容的評估中平均低於人類21%）。

English

Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

SurveyBench：大型語言模型（及其代理）撰寫學術綜述的能力如何？

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

摘要

Support