编织黄金线：在语言模型中对长篇生成进行基准测试

摘要

长文本语言模型（LMs）的能力通常通过“草堆中的针”（NIAH）测试来评估，该测试包括旨在评估模型在大型文本序列（“草堆”）中识别特定信息（“针”）的能力的任务。虽然这些基准测试衡量了模型对长文本输入序列的理解能力，但它们并不能有效地衡量长篇文本生成的质量——这对于设计提案和创意写作等应用至关重要。为了弥补这一差距，我们引入了一个新的长篇文本评估基准，名为“纺金线”（SGT），该基准测试模型在生成的长文本序列中识别特定事件的能力。在这个基准测试中，我们要求长文本LMs创建必须包含特定事件或约束的长篇文本，并评估它们整合这些元素的能力。我们在四种不同场景、三种提示指令类型和两种不同生成长度设置（16K和32K）下评估了十个长文本LMs。尽管这些模型在NIAH基准测试上表现良好，但没有一个在“纺金线”测试中表现令人满意，这引发了对它们生成遵循指令的连贯长篇文本能力的担忧。此外，随着生成文本长度的增加，所有模型的性能都显著下降。

English

The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

编织黄金线：在语言模型中对长篇生成进行基准测试

Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

摘要

Support