WebNovelBench：将LLM小说家置于网络小说分发平台

摘要

稳健评估大型语言模型（LLMs）的长篇叙事能力仍面临重大挑战，现有基准往往在规模、多样性或客观衡量标准上有所欠缺。为此，我们推出了WebNovelBench，这是一个专为评估长篇小说生成而设计的新颖基准。WebNovelBench利用了一个包含超过4000部中文网络小说的大规模数据集，将评估任务设定为从梗概到故事的生成。我们提出了一套多维度框架，涵盖八个叙事质量指标，通过LLM-as-Judge方法自动评估。采用主成分分析法汇总得分，并将其映射至与人类创作作品相比的百分位排名。实验表明，WebNovelBench能有效区分人类杰作、热门网络小说及LLM生成内容。我们对24个前沿LLM进行了全面分析，排序其叙事能力，并为未来发展提供洞见。该基准为评估和推进LLM驱动的叙事生成提供了一种可扩展、可复制且数据驱动的方法论。

English

Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

WebNovelBench：将LLM小说家置于网络小说分发平台

WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

摘要

Support