SurveyBench: Hoe Goed Kunnen LLM(-Agents) Academische Overzichten Schrijven?

Samenvatting

Academisch overzichtswerk, dat uitgebreide literatuur destilleert tot een samenhangend en inzichtelijk verhaal, blijft een arbeidsintensieve en intellectueel veeleisende taak. Hoewel recente benaderingen, zoals algemene DeepResearch-agents en gespecialiseerde methoden voor overzichten, automatisch overzichten kunnen genereren (ook wel LLM4Survey genoemd), schieten hun uitvoer vaak tekort in vergelijking met menselijke standaarden en ontbreekt er een rigoureus, op de lezer afgestemd benchmark om hun tekortkomingen grondig aan het licht te brengen. Om deze leemte op te vullen, stellen we een gedetailleerd, quiz-gestuurd evaluatiekader voor, genaamd SurveyBench, dat bestaat uit (1) typische overzichtsthema's afkomstig uit recente 11,343 arXiv-artikelen en bijbehorende 4,947 hoogwaardige overzichten; (2) een veelzijdige metrische hiërarchie die de kwaliteit van de opzet (bijv. dekking, logische samenhang), de inhoudskwaliteit (bijv. synthesegranulariteit, duidelijkheid van inzichten) en de niet-tekstuele rijkdom beoordeelt; en (3) een dual-mode evaluatieprotocol dat inhoudsgerichte en quiz-gebaseerde beantwoordbaarheidstests omvat, expliciet afgestemd op de informatiebehoeften van lezers. De resultaten tonen aan dat SurveyBench bestaande LLM4Survey-benaderingen effectief uitdaagt (bijv. gemiddeld 21% lager dan menselijke prestaties in inhoudsgerichte evaluatie).

English

Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

SurveyBench: Hoe Goed Kunnen LLM(-Agents) Academische Overzichten Schrijven?

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Samenvatting

Support