共振 RoPE：提升大型語言模型對上下文長度的泛化能力

摘要

本文討論在具有旋轉位置嵌入（RoPE）的大型語言模型（LLMs）中，面對訓練短、測試長（TSTL）情境的挑戰，即在較短序列上預訓練的模型在較長序列中遇到超出分布範圍（OOD）的標記位置時會遇到困難。我們引入了Resonance RoPE，這是一種新方法，旨在通過精煉RoPE功能的內插，縮小TSTL情境中的泛化差距，顯著提高模型性能，而無需額外的在線計算成本。此外，我們提出了PosGen，這是一個新的合成基準，專門設計用於TSTL情境中的細粒度行為分析，旨在從長內容上隨著時間不斷增加的標記生成困難和識別新標記位置的挑戰中進行隔離。我們在合成任務上的實驗表明，在應用Resonance RoPE後，Transformers更好地且更穩健地識別OOD位置。我們的大量LLM實驗還表明，在將Resonance RoPE應用於當前最先進的RoPE縮放方法YaRN後，模型在上游語言建模任務和各種下游長文本應用中都表現出優異性能。

English

This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.

共振 RoPE：提升大型語言模型對上下文長度的泛化能力

Resonance RoPE: Improving Context Length Generalization of Large Language Models

摘要

Support