RoPEは長いコンテキストにおいて、位置もトークンも区別しないことが証明可能である。

要旨

我々は、Transformerベースの長文脈言語モデルにおける回転位置埋め込み（RoPE）の本質的な限界を特定する。理論的解析では、文脈の具体的な内容から抽象化し、その長さのみに依存する。我々は、文脈長が増加するにつれて、RoPEベースの注意機構が予測不可能になり、その有効性の中核となる二つの特性を失うことを証明する。第一に、局所性バイアスを失う：RoPEは、近い位置を大幅に離れた位置よりも優先する傾向がなくなる。第二に、トークンの関連性における一貫性を失う：ある位置で別の位置より高い注意スコアを受け取るキーベクトルが、別の位置では低いスコアを受け取る可能性がある。いずれの場合も、失敗確率は0.5に近づき、ランダムな推測と変わらなくなる。さらに、キートークンが別の位置に移動したり、別のトークンに置き換えられたりしても注意スコアが変わらないことがあり、これは位置やトークンの識別に失敗していることを示す。RoPEベースの調整は、位置の識別とトークンの識別の間にトレードオフをもたらすが、両方を同時に維持することはできない。今日の長文脈モデルで一般的な慣行であるRoPEベースのハイパーパラメータを増加させることは、異なるトークンの識別に役立つが、位置を識別する能力を不可避的に犠牲にする。我々の実証分析は、マルチヘッド・マルチレイヤーアーキテクチャではこれらの限界を克服するには不十分であることを示している。これらの知見は、将来のTransformer長文脈言語モデルにおいて、位置とトークンの順序を符号化する根本的に新しいメカニズムが必要となる可能性を示唆している。

English

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.