RoPE의 배경: 인과적 마스크가 위치 정보를 어떻게 인코딩하는가?

초록

RoPE와 같은 명시적인 위치 인코딩이 트랜스포머 디코더에서 위치 정보의 주요 원천이지만, causal mask 또한 위치 정보를 제공합니다. 본 연구에서 우리는 causal mask가 입력에 매개변수나 인과적 의존성이 없더라도 어텐션 점수에 위치 의존적 패턴을 유도할 수 있음을 증명합니다. 우리의 이론적 분석은 유도된 어텐션 패턴이 일반적인 위치 인코딩의 동작을 반영하며, 근처의 query-key 쌍을 선호하는 경향이 있음을 보여줍니다. 실험적 분석은 학습된 모델이 동일한 동작을 보이며, 학습된 매개변수가 이러한 패턴을 더욱 증폭시킨다는 것을 확인합니다. 특히, causal mask와 RoPE의 상호작용이 RoPE의 상대적 어텐션 점수 패턴을 비상대적인 패턴으로 왜곡시킨다는 것을 발견했습니다. 우리는 현대의 대규모 언어 모델에서 이러한 효과를 일관되게 관찰했으며, 명시적인 위치 인코딩과 함께 causal mask를 위치 정보의 원천으로 고려하는 것의 중요성을 시사합니다.

English

While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

RoPE의 배경: 인과적 마스크가 위치 정보를 어떻게 인코딩하는가?

Behind RoPE: How Does Causal Mask Encode Positional Information?

초록

Support