RoPE背後：因果掩碼如何編碼位置信息？

摘要

雖然如RoPE這類顯式位置編碼是Transformer解碼器中位置信息的主要來源，但因果遮罩同樣提供了位置信息。在本研究中，我們證明了因果遮罩能在無需參數或輸入中因果依賴的情況下，誘導出注意力分數中的位置依賴模式。我們的理論分析表明，這種誘導的注意力模式傾向於偏好鄰近的查詢-鍵對，這與常見位置編碼的行為相呼應。實證分析確認，經過訓練的模型展現出相同的行為，且學習到的參數進一步放大了這些模式。值得注意的是，我們發現因果遮罩與RoPE的交互作用，會將RoPE的相對注意力分數模式扭曲為非相對模式。我們在現代大型語言模型中一致觀察到這一效應，這提示了在考慮位置信息來源時，將因果遮罩與顯式位置編碼並重的重要性。

English

While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.

RoPE背後：因果掩碼如何編碼位置信息？

Behind RoPE: How Does Causal Mask Encode Positional Information?

摘要

Support