因果注意力機制與前瞻鍵

摘要

在標準的因果注意力機制中，每個詞元的查詢、鍵和值（QKV）是靜態的，且僅編碼了先前的上下文。我們引入了帶有前瞻鍵的因果注意力機制（CASTLE），這是一種隨著上下文展開而不斷更新每個詞元鍵的注意力機制。我們將這些更新後的鍵稱為前瞻鍵，因為它們屬於較早的位置，卻整合了相對於這些位置之後出現的詞元信息，同時嚴格保持了自回歸特性。儘管該機制看似是順序執行的，但我們推導出了一種數學等價性，避免了在每個位置顯式生成前瞻鍵，從而實現了高效的並行訓練。在語言建模基準測試中，CASTLE在不同模型規模上始終優於標準的因果注意力機制，降低了驗證困惑度，並在一系列下游任務中提升了性能。

English

In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

因果注意力機制與前瞻鍵

Causal Attention with Lookahead Keys

摘要

Support