WebNovelBench: Web小説配信におけるLLM作家の評価

要旨

大規模言語モデル（LLM）の長編ストーリーテリング能力を堅牢に評価することは依然として重要な課題であり、既存のベンチマークでは必要な規模、多様性、または客観的な指標が不足していることが多い。この問題に対処するため、我々は長編小説生成の評価に特化した新しいベンチマークであるWebNovelBenchを提案する。WebNovelBenchは、4,000以上の中国語ウェブ小説からなる大規模データセットを活用し、評価を「あらすじから物語を生成するタスク」として設定する。我々は、8つの物語品質次元を網羅する多面的な評価フレームワークを提案し、LLM-as-Judgeアプローチを用いて自動的に評価を行う。スコアは主成分分析を用いて集約され、人間が執筆した作品に対する百分位順位にマッピングされる。実験の結果、WebNovelBenchは人間が書いた傑作、人気のあるウェブ小説、およびLLMが生成したコンテンツを効果的に区別できることが示された。我々は24の最先端LLMを包括的に分析し、それらのストーリーテリング能力をランク付けし、今後の開発に向けた洞察を提供する。このベンチマークは、LLMによる物語生成を評価し、進歩させるためのスケーラブルで再現可能なデータ駆動型の方法論を提供する。

English

Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.

WebNovelBench: Web小説配信におけるLLM作家の評価

WebNovelBench: Placing LLM Novelists on the Web Novel Distribution

要旨

Support