EndPrompt：通过末端锚定实现高效的长上下文扩展

摘要

扩展大语言模型的上下文窗口通常需要在目标长度序列上进行训练，这会引发二次方的内存和计算成本，使得长上下文适配成本高昂且难以复现。我们提出EndPrompt方法，仅使用短训练序列即可实现有效的上下文扩展。其核心洞察在于：使模型暴露于长程相对位置距离并不需要构建完整长度的输入——我们将原始短上下文保留为完整的第一个片段，并附加一个简短的终端提示作为第二个片段，为其分配接近目标上下文长度的位置索引。这种两段式结构在短物理序列中同时引入了局部和长程相对距离，同时保持了训练文本的语义连续性——这一特性在基于块（chunk）的模拟方法中是不存在的，因为后者会分割连续的上下文。我们基于旋转位置编码和伯恩斯坦不等式提供了理论分析，证明位置插值会对注意力函数施加严格的平滑约束，而共享的Transformer参数进一步抑制了向未观测中间距离的不稳定外推。将EndPrompt应用于LLaMA系列模型，将其上下文窗口从8K扩展到64K，平均RULER得分为76.03，并在LongBench上取得了最高平均分，超越了LCEG（72.24）、LongLoRA（72.95）和全长度微调（69.23），同时所需计算量大幅减少。这些结果表明，长上下文泛化能力可以从稀疏的位置监督中诱导出来，挑战了“密集的长序列训练对于可靠的上下文窗口扩展是必要的”这一主流假设。代码已开源：https://github.com/clx1415926/EndPrompt。

English

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.