EndPrompt: ターミナルアンカリングによる効率的な長文脈拡張

要旨

大規模言語モデルのコンテキストウィンドウを拡張するには、通常、目標長の系列で学習を行う必要があり、二次的なメモリと計算コストが発生するため、長文脈への適応は高コストで再現が困難です。本論文では、短い学習系列のみを用いて効果的なコンテキスト拡張を実現する手法EndPromptを提案します。その核となる洞察は、モデルに長距離の相対位置距離を露出させるために、完全長の入力を構築する必要はないという点です。すなわち、元の短いコンテキストをそのまま第1セグメントとして保持し、第2セグメントとして短い終端プロンプトを追加し、それに目標コンテキスト長付近の位置インデックスを割り当てます。この2セグメント構成により、短い物理系列内で局所的および長距離の相対距離を導入しつつ、学習テキストの意味的連続性を維持します。これは、連続したコンテキストを分割するチャンクベースのシミュレーション手法には欠けている特性です。我々はRotary Position EmbeddingとBernsteinの不等式に基づく理論的分析を提供し、位置補間が注意関数に厳密な平滑性制約を課し、共有されたTransformerパラメータが未観測の中間距離への不安定な外挿をさらに抑制することを示します。LLaMAファミリーのモデルに適用し、コンテキストウィンドウを8Kから64Kに拡張した場合、EndPromptは平均RULERスコア76.03、LongBenchで最高平均を達成し、LCEG（72.24）、LongLoRA（72.95）、完全長ファインチューニング（69.23）を上回りながら、計算量を大幅に削減します。これらの結果は、長文脈への汎化が疎な位置監視から誘導可能であることを示し、信頼性のあるコンテキストウィンドウ拡張には高密度な長系列学習が必要であるという従来の前提に挑戦します。コードはhttps://github.com/clx1415926/EndPromptで入手可能です。

English

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.