WebNovelBench:將LLM小說家置於網絡小說分發平台
WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
May 20, 2025
作者: Leon Lin, Jun Zheng, Haidong Wang
cs.AI
摘要
稳健评估大型语言模型(LLMs)在长篇故事创作方面的能力仍面临重大挑战,现有基准测试往往缺乏必要的规模、多样性或客观衡量标准。为此,我们推出了WebNovelBench,这是一个专为评估长篇小说生成而设计的新颖基准。WebNovelBench利用了一个包含超过4000部中文网络小说的大规模数据集,将评估任务设定为从梗概到故事的生成过程。我们提出了一套多维度框架,涵盖八个叙事质量指标,通过LLM作为评判者的方法自动评估。评分采用主成分分析法汇总,并映射至与人类作品相比的百分位排名。实验表明,WebNovelBench能有效区分人类创作的杰作、流行网络小说及LLM生成的内容。我们对24个最先进的LLM进行了全面分析,排名其讲故事能力,并为未来发展提供了洞见。该基准为评估和推进LLM驱动的叙事生成提供了一种可扩展、可复制且数据驱动的方法论。
English
Robustly evaluating the long-form storytelling capabilities of Large Language
Models (LLMs) remains a significant challenge, as existing benchmarks often
lack the necessary scale, diversity, or objective measures. To address this, we
introduce WebNovelBench, a novel benchmark specifically designed for evaluating
long-form novel generation. WebNovelBench leverages a large-scale dataset of
over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story
generation task. We propose a multi-faceted framework encompassing eight
narrative quality dimensions, assessed automatically via an LLM-as-Judge
approach. Scores are aggregated using Principal Component Analysis and mapped
to a percentile rank against human-authored works. Our experiments demonstrate
that WebNovelBench effectively differentiates between human-written
masterpieces, popular web novels, and LLM-generated content. We provide a
comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling
abilities and offering insights for future development. This benchmark provides
a scalable, replicable, and data-driven methodology for assessing and advancing
LLM-driven narrative generation.Summary
AI-Generated Summary