迷失在故事中：大语言模型生成长篇故事时的一致性缺陷

摘要

当故事讲述者遗忘了自己的故事会发生什么？当前大型语言模型（LLMs）已能生成数万字的长篇叙事，却常常难以保持整体一致性。在生成长篇叙事时，这些模型可能与其已确立的事实、角色特征和世界观设定产生矛盾。现有的故事生成基准主要关注情节质量和流畅度，对一致性错误的研究尚不充分。为填补这一空白，我们推出ConStory-Bench——专为评估长篇故事生成中叙事一致性而设计的基准框架。该框架涵盖四大任务场景下的2000个提示词，定义了包含19个细分类别的五类错误分类体系。我们还开发了ConStory-Checker自动化检测流程，能够识别矛盾点并将每个判断锚定于显性文本证据。通过五大研究问题对多种LLMs进行评估后，我们发现一致性错误呈现明显规律性：最常出现于事实与时间维度，高发于叙事中段，集中于词元级熵值较高的文本片段，且特定错误类型存在共生现象。这些发现可为提升长篇叙事生成一致性的后续研究提供指引。项目页面详见：https://picrew.github.io/constory-bench.github.io/。

English

What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.