共鸣RoPE：提高大型语言模型对上下文长度的泛化能力

摘要

本文讨论了在装备了旋转位置嵌入（RoPE）的大型语言模型（LLMs）中，面临着训练短、测试长（TSTL）场景的挑战，即在较短序列上预训练的模型在更长序列中的分布外（OOD）标记位置方面遇到困难的问题。我们引入了共振RoPE，这是一种新颖的方法，旨在通过优化RoPE特征的插值，特别是针对OOD位置，来缩小TSTL场景中的泛化差距，显著提高模型性能，而无需额外的在线计算成本。此外，我们提出了PosGen，这是一个新的合成基准，专门设计用于在TSTL场景中进行细粒度行为分析，旨在将在长上下文中生成标记的难度不断增加与识别新标记位置的挑战相隔离。我们在合成任务上的实验表明，在应用共振RoPE后，Transformers更好地且更稳健地识别OOD位置。我们广泛的LLM实验还表明，在将共振RoPE应用于当前最先进的RoPE缩放方法YaRN后，模型在上游语言建模任务和各种下游长文本应用中表现出更优越的性能。

English

This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.

共鸣RoPE：提高大型语言模型对上下文长度的泛化能力

Resonance RoPE: Improving Context Length Generalization of Large Language Models

摘要

Support