WebNovelBench: 웹 소설 배포 환경에서의 LLM 소설가 평가

초록

대규모 언어 모델(LLM)의 장편 스토리텔링 능력을 견고하게 평가하는 것은 여전히 큰 도전 과제로 남아 있습니다. 기존 벤치마크는 종종 필요한 규모, 다양성 또는 객관적인 측정 기준이 부족하기 때문입니다. 이를 해결하기 위해 우리는 장편 소설 생성 평가를 위해 특별히 설계된 새로운 벤치마크인 WebNovelBench를 소개합니다. WebNovelBench는 4,000편 이상의 중국 웹 소설로 구성된 대규모 데이터셋을 활용하여, 평가를 시놉시스에서 스토리 생성 작업으로 구성합니다. 우리는 8가지 서사적 품질 차원을 포괄하는 다면적 프레임워크를 제안하며, 이를 LLM-as-Judge 접근법을 통해 자동으로 평가합니다. 점수는 주성분 분석(PCA)을 사용하여 집계되고, 인간이 작성한 작품과 비교하여 백분위 순위로 매핑됩니다. 우리의 실험은 WebNovelBench가 인간이 쓴 걸작, 인기 웹 소설, 그리고 LLM이 생성한 콘텐츠를 효과적으로 구분할 수 있음을 보여줍니다. 우리는 24개의 최첨단 LLM에 대한 포괄적인 분석을 제공하며, 그들의 스토리텔링 능력을 순위화하고 향후 개발을 위한 통찰을 제시합니다. 이 벤치마크는 LLM 기반 서사 생성의 평가와 발전을 위한 확장 가능하고 재현 가능하며 데이터 기반의 방법론을 제공합니다.

English

Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

WebNovelBench: 웹 소설 배포 환경에서의 LLM 소설가 평가

WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

초록

Support