紡織黃金線：在語言模型中進行長文生成的基準測試

摘要

長文本語言模型（LMs）的能力通常是通過“草堆中的針”（NIAH）測試來評估的，該測試包含旨在評估模型識別大文本序列（“草堆”）中特定信息（“針”）的任務。儘管這些基準評估了模型理解長文本輸入序列的能力，但它們並不能有效地衡量長文本生成的質量——這對於設計提案和創意寫作等應用至關重要。為了彌補這一不足，我們引入了一個新的長文本評估基準，名為“紡金線”（SGT），該基準測試模型識別生成的長文本序列中特定事件的能力。在這個基準中，我們要求長文本LMs創建必須包含特定事件或約束的長文本，並評估它們整合這些元素的能力。我們在四個不同情境、三種提示指令類型和兩種不同生成長度設置（16K和32K）下評估了十個長文本LMs。儘管這些模型在NIAH基準上表現良好，但在“紡金線”測試中沒有一個表現令人滿意，這引發了對它們生成遵循指示的連貫長文本能力的擔憂。此外，隨著生成文本長度的增加，所有模型的性能均顯著下降。

English

The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

紡織黃金線：在語言模型中進行長文生成的基準測試

Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

摘要

Support