LitBench：創意寫作可靠評估的基準與數據集

摘要

评估由大型语言模型（LLMs）生成的创意写作仍具挑战性，因为开放式叙事缺乏明确的标准答案。在缺乏高效自动化评估方法的情况下，现成的（OTS）语言模型被用作零样本评判者，然而其在此情境下的可靠性尚不明确。为了实现对创意写作的稳健评估，我们引入了LitBench，这是首个用于创意写作验证的标准化基准及配套数据集，包含从Reddit提取的2,480个去偏、人工标注的故事对比测试集，以及一个包含43,827对人工偏好标签的训练语料库。利用LitBench，我们（i）对零样本LLM评判者进行基准测试，（ii）训练了Bradley Terry和生成式奖励模型，以及（iii）开展了一项在线人类研究，以验证奖励模型在最新LLM生成故事上的排名。我们的基准测试显示，Claude-3.7-Sonnet作为现成评判者表现最佳，与人类偏好的一致性达到73%；在训练后的奖励模型中，Bradley-Terry和生成式奖励模型均达到了78%的准确率，超越了所有现成评判者。在线人类研究进一步证实，我们训练的奖励模型在新颖的LLM生成故事中持续与人类偏好保持一致。我们在https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461发布了LitBench及奖励模型，为创意写作系统的可靠自动化评估与优化提供了经过验证的资源。

English

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

LitBench：創意寫作可靠評估的基準與數據集

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

摘要

Support