黄金の糸を紡ぐ：言語モデルにおける長文生成のベンチマーキング

要旨

長いコンテキストを持つ言語モデル（LMs）の能力は、しばしば「ハヤスタックの中の針」（NIAH）テストを用いて評価されます。このテストは、大きなテキストシーケンス（「ハヤスタック」）の中から特定の情報（「針」）を特定するモデルの能力を評価するよう設計されたタスクから構成されています。これらのベンチマークは、モデルが長いコンテキストの入力シーケンスをどれだけ理解しているかを測定しますが、長い形式のテキスト生成の品質を効果的に評価することはできません。これは、デザイン提案や創造的な執筆などのアプリケーションにとって重要な側面です。このギャップを埋めるために、私たちは新しい長い形式のテキスト評価ベンチマーク、Spinning the Golden Thread（SGT）を導入しました。このベンチマークは、モデルが生成された長いテキストシーケンス内で特定のイベントを特定する能力をテストします。このベンチマークでは、長いコンテキストLMsに対して、特定のイベントや制約を含む長い形式のテキストを作成するよう促し、これらの要素をどれだけ取り入れることができるかを評価します。私たちは、10の長いコンテキストLMsを4つの異なるシナリオ、3種類のプロンプト指示、および2つの異なる生成長設定（16Kおよび32K）で評価しました。これらのモデルはNIAHベンチマークで良い成績を収めていますが、どのモデルもSpinning the Golden Threadで満足できるパフォーマンスを示さず、指示に従う連続した長い形式のテキストを生成する能力について懸念が高まっています。さらに、生成されたテキストの長さが増すにつれて、すべてのモデルが著しいパフォーマンスの低下を示しています。

English

The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

黄金の糸を紡ぐ：言語モデルにおける長文生成のベンチマーキング

Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

要旨

Support