RoPEの背後：因果的マスクはどのように位置情報をエンコードするのか？

要旨

Transformerデコーダにおいて、RoPEのような明示的な位置エンコーディングが位置情報の主要な源である一方で、因果マスクも位置情報を提供します。本研究では、因果マスクが、パラメータや入力における因果依存関係がなくても、アテンションスコアに位置依存のパターンを誘導し得ることを証明します。理論分析によれば、誘導されるアテンションパターンは、一般的な位置エンコーディングの挙動を反映して、近接するクエリとキーのペアを優先する傾向があります。実証分析では、学習済みモデルが同じ挙動を示し、学習されたパラメータがこれらのパターンをさらに増幅することが確認されました。特に、因果マスクとRoPEの相互作用が、RoPEの相対的アテンションスコアパターンを非相対的なものに歪めることがわかりました。この効果は現代の大規模言語モデルにおいて一貫して観察され、明示的な位置エンコーディングと並んで、因果マスクを位置情報の源として考慮することの重要性を示唆しています。

English

While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

RoPEの背後：因果的マスクはどのように位置情報をエンコードするのか？

Behind RoPE: How Does Causal Mask Encode Positional Information?

要旨

Support