RoPE在长上下文中既无法区分位置也无法区分令牌,此结论可证明。
RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
May 15, 2026
作者: Yufeng Du, Phillip Harris, Minyang Tian, Eliu A Huerta, Srikanth Ronanki, Subendhu Rongali, Aram Galstyan, Hao Peng
cs.AI
摘要
我们揭示了基于Transformer的长上下文语言模型中旋转位置编码(RoPE)的内在局限性。我们的理论分析抽象了上下文的具体内容,仅依赖于其长度。我们证明,随着上下文长度的增加,基于RoPE的注意力变得不可预测,并失去了对其有效性至关重要的两个特性。首先,它失去了局部性偏好:RoPE倾向于更近位置而非显著更远位置的概率不再更高。其次,它失去了标记相关性的连贯性:某个位置的键向量可能比另一位置的替代键向量获得更高的注意力分数,但在另一位置则可能获得更低的分数。在这两种情况下,失败的概率接近0.5,与随机猜测无异。我们进一步证明,当键标记被移动到不同位置甚至被不同标记替换时,注意力分数可以保持不变,这表明RoPE无法区分位置或标记。调整RoPE的基频需要在区分位置和区分标记之间进行权衡,但无法同时保持两者。增加RoPE基频超参数(当前长上下文模型的常见做法)有助于区分不同标记,但不可避免地牺牲了区分位置的能力。我们的实证分析表明,多头、多层架构不足以克服这些局限性。我们的研究结果表明,未来的Transformer长上下文语言模型可能需要全新的位置和标记顺序编码机制。
English
We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.