RoPE는 긴 컨텍스트에서 위치도 토큰도 구분하지 않음이 증명 가능하다.

초록

우리는 Transformer 기반 장기 문맥 언어 모델에서 회전 위치 임베딩(Rotary Positional Embeddings, RoPE)의 고유한 한계를 규명한다. 본 이론적 분석은 문맥의 구체적 내용을 추상화하고 오직 문맥의 길이에만 의존한다. 우리는 문맥 길이가 증가함에 따라 RoPE 기반 어텐션이 예측 불가능해지며, 그 효과성의 핵심이 되는 두 가지 속성을 상실함을 증명한다. 첫째, 로컬리티 편향(locality bias)을 잃는다: RoPE는 더 가까운 위치를 멀리 떨어진 위치보다 더 선호할 가능성이 더 이상 높지 않다. 둘째, 토큰 관련성 일관성(consistency in token relevance)을 상실한다: 한 위치에서 대안보다 더 높은 어텐션 점수를 받은 키 벡터가 다른 위치에서는 더 낮은 점수를 받을 수 있다. 두 경우 모두, 실패 확률은 0.5에 근접하여 무작위 추측보다 나을 바 없다. 또한, 키 토큰이 다른 위치로 이동하거나 다른 토큰으로 대체되더라도 어텐션 점수가 변하지 않을 수 있으며, 이는 위치나 토큰을 구별하는 데 실패함을 나타낸다. RoPE 베이스를 조정하면 위치 구별과 토큰 구별 사이에서 절충이 발생하지만, 두 가지를 동시에 보존할 수는 없다. 오늘날 장기 문맥 모델에서 흔히 사용되는 방식인 RoPE 베이스 하이퍼파라미터를 증가시키면 서로 다른 토큰을 구별하는 데 도움이 되지만, 위치를 구별하는 능력은 필연적으로 희생된다. 우리의 실증적 분석은 다중 헤드, 다중 계층 구조가 이러한 한계를 극복하기에 충분하지 않음을 보여준다. 본 연구 결과는 향후 Transformer 기반 장기 문맥 언어 모델에서 위치와 토큰 순서를 인코딩하는 근본적으로 새로운 메커니즘이 필요할 수 있음을 시사한다.

English

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.