終端提示：通過終端錨定實現高效的長上下文擴展

摘要

扩展大型語言模型的上下文窗口通常需要在目標長度的序列上進行訓練，這會帶來二次方的記憶體與計算成本，使得長上下文適配既昂貴又難以重現。我們提出 EndPrompt，一種僅使用短訓練序列即可實現有效上下文擴展的方法。其核心洞見在於：讓模型暴露於長程相對位置距離並不需要建構完整的長度輸入——我們保留原始短上下文作為完整的第一區段，並附加一個簡短的終端提示作為第二區段，為其分配接近目標上下文長度的位置索引。這種兩區段結構在短物理序列中同時引入了局部與長程相對距離，同時維持訓練文本的語義連貫性——這是分割連續上下文的基於區塊的模擬方法所缺乏的特性。我們提供了基於旋轉位置嵌入與伯恩斯坦不等式的理論分析，顯示位置插值對注意力函數施加了嚴格的平滑性約束，而共享的 Transformer 參數進一步抑制了對未觀測中間距離的不穩定外推。應用於將上下文窗口從 8K 擴展至 64K 的 LLaMA 系列模型，EndPrompt 在 RULER 上平均得分 76.03，在 LongBench 上取得最高平均值，超越 LCEG（72.24）、LongLoRA（72.95）及完整長度微調（69.23），同時所需計算量大幅減少。這些結果表明，長上下文泛化可從稀疏的位置監督中誘發，挑戰了「密集長序列訓練是可靠上下文窗口擴展的必要條件」的主流假設。程式碼已於 https://github.com/clx1415926/EndPrompt 公開。

English

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.