迷失於故事之中：大型語言模型在長篇故事生成中的一致性錯誤

摘要

當故事講述者遺忘了自身的故事，會發生什麼？大型語言模型如今已能生成數萬字的敘事，卻往往難以維持整體一致性。在創作長篇故事時，這些模型可能與其已設定的情節事實、角色特質和世界觀規則產生矛盾。現有的故事生成評測基準主要關注情節品質與流暢度，對一致性錯誤的探討尚屬空白。為此，我們提出ConStory-Bench——專為評估長篇故事生成中敘事一致性而設計的基準框架。該框架包含四類任務場景下的2000個提示詞，並定義了5大錯誤類別與19個細分錯誤類型的分類體系。我們同時開發了ConStory-Checker自動化檢測流程，能識別矛盾並將每個判斷錨定於明確的文本證據。透過五個研究問題對多種大型語言模型進行評估，我們發現一致性錯誤呈現明顯規律：最常出現於事實與時間維度，高發於敘事中段，集中於詞元層級熵值較高的文本片段，且特定錯誤類型存在共現傾向。這些發現可為未來提升長篇敘事一致性的研究提供指引。項目頁面請訪問：https://picrew.github.io/constory-bench.github.io/。

English

What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.

迷失於故事之中：大型語言模型在長篇故事生成中的一致性錯誤

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

摘要

Support