SurveyBench: LLM(-에이전트)가 학술 서베이를 얼마나 잘 작성할 수 있는가?

초록

방대한 문헌을 일관성 있고 통찰력 있는 서술로 정리하는 학술 서베이 작성은 여전히 노동 집약적이며 지적으로 요구되는 작업이다. 최근 일반 딥리서치 에이전트와 서베이 전용 방법과 같은 접근 방식이 자동으로 서베이를 생성할 수 있지만(일명 LLM4Survey), 그 결과물은 종종 인간의 기준에 미치지 못하며, 그 결함을 철저히 드러내기 위한 엄격하고 독자 중심의 벤치마크가 부족하다. 이러한 격차를 메우기 위해, 우리는 퀴즈 기반의 세분화된 평가 프레임워크인 SurveyBench를 제안한다. 이 프레임워크는 (1) 최근 11,343편의 arXiv 논문과 이에 상응하는 4,947편의 고품질 서베이에서 도출된 전형적인 서베이 주제, (2) 개요 품질(예: 범위의 폭, 논리적 일관성), 내용 품질(예: 종합의 세분성, 통찰력의 명확성), 그리고 비텍스트적 풍부성을 평가하는 다면적 메트릭 계층 구조, 그리고 (3) 독자의 정보 요구와 명시적으로 일치하는 내용 기반 및 퀴즈 기반 응답 가능성 테스트를 포함하는 이중 모드 평가 프로토콜을 특징으로 한다. 결과는 SurveyBench가 기존 LLM4Survey 접근 방식(예: 내용 기반 평가에서 평균 21% 낮음)에 효과적으로 도전함을 보여준다.

English

Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers' informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

SurveyBench: LLM(-에이전트)가 학술 서베이를 얼마나 잘 작성할 수 있는가?

SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

초록

Support