ChatPaper.aiChatPaper

RoPE 在長上下文中既不區分位置也不區分標記,此為可證明的。

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

May 15, 2026
作者: Yufeng Du, Phillip Harris, Minyang Tian, Eliu A Huerta, Srikanth Ronanki, Subendhu Rongali, Aram Galstyan, Hao Peng
cs.AI

摘要

我們辨識出基於Transformer的長上下文語言模型中旋轉位置編碼(RoPE)的內在侷限。我們的理論分析抽象化了上下文的具體內容,僅依賴其長度。我們證明,隨著上下文長度增加,基於RoPE的注意力機制變得不可預測,並喪失了對其有效性至關重要的兩項特性。首先,它失去了局部性偏誤:RoPE不再傾向於偏好較近的位置而非較遠的位置。其次,它失去了標記相關性的一致性:一個關鍵向量在某個位置上獲得比另一個向量更高的注意力分數,但在另一個位置上可能獲得較低的分數。在這兩種情況下,失敗的機率接近0.5,僅比隨機猜測好一些。我們進一步證明,當一個關鍵標記被移動到不同位置,甚至被替換為另一個標記時,注意力分數可能保持不變,這顯示出無法區分位置或標記。調整RoPE的基底需要在區分位置與區分標記之間進行取捨,但無法同時保留兩者。增加RoPE基底超參數(這是當今長上下文模型的常見做法)有助於區分不同標記,但無可避免地犧牲了區分位置的能力。我們的實證分析顯示,多頭、多層架構不足以克服這些侷限。我們的研究結果表明,未來的Transformer長上下文語言模型可能需要從根本上設計新的機制來編碼位置與標記順序。
English
We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.