EndPrompt: 종단 앵커링을 통한 효율적인 장문맥 확장

초록

대규모 언어 모델의 컨텍스트 윈도우를 확장하려면 일반적으로 목표 길이의 시퀀스에 대한 학습이 필요하며, 이는 제곱에 비례하는 메모리 및 계산 비용을 초래하여 긴 컨텍스트 적응을 비용이 많이 들고 재현하기 어렵게 만듭니다. 본 논문에서는 짧은 학습 시퀀스만으로 효과적인 컨텍스트 확장을 달성하는 방법인 EndPrompt를 제안합니다. 핵심 통찰은 모델이 장거리 상대적 위치 거리에 노출되기 위해 전체 길이의 입력을 구성할 필요가 없다는 점입니다. 즉, 원래의 짧은 컨텍스트를 온전한 첫 번째 세그먼트로 유지하고, 두 번째 세그먼트로 짧은 종단 프롬프트를 추가하여 목표 컨텍스트 길이에 가까운 위치 인덱스를 할당합니다. 이 2-세그먼트 구성은 연속적인 컨텍스트를 분할하는 청크 기반 시뮬레이션 접근 방식에는 없는 특성인 훈련 텍스트의 의미적 연속성을 유지하면서 짧은 물리적 시퀀스 내에서 국소적 및 장거리 상대적 거리를 모두 도입합니다. 우리는 Rotary Position Embedding과 Bernstein 부등식에 기반한 이론적 분석을 제공하며, 위치 보간이 어텐션 함수에 엄격한 평활성 제약을 유도하고 공유된 Transformer 파라미터가 관찰되지 않은 중간 거리에 대한 불안정한 외삽을 추가로 억제함을 보여줍니다. LLaMA 계열 모델에 적용되어 컨텍스트 윈도우를 8K에서 64K로 확장한 결과, EndPrompt는 평균 RULER 점수 76.03과 LongBench에서 가장 높은 평균 점수를 달성하여 LCEG(72.24), LongLoRA(72.95), 전체 길이 파인튜닝(69.23)을 능가하면서도 훨씬 적은 계산량을 필요로 합니다. 이러한 결과는 긴 컨텍스트 일반화가 희소한 위치 감독으로부터 유도될 수 있음을 보여주며, 신뢰할 수 있는 컨텍스트 윈도우 확장을 위해 조밀한 긴 시퀀스 학습이 필요하다는 기존의 가정에 도전합니다. 코드는 https://github.com/clx1415926/EndPrompt에서 확인할 수 있습니다.

English

Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text--a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at https://github.com/clx1415926/EndPrompt.