前瞻键因果注意力机制

摘要

在标准的因果注意力机制中，每个标记的查询、键和值（QKV）是静态的，仅编码先前的上下文信息。我们引入了带有前瞻键的因果注意力机制（CASTLE），该机制随着上下文的展开不断更新每个标记的键。我们将这些更新后的键称为前瞻键，因为它们属于较早的位置，却整合了相对于这些位置之后出现的标记信息，同时严格保持了自回归特性。尽管该机制看似是顺序执行的，但我们推导出了一个数学等价形式，避免了在每个位置显式生成前瞻键，从而实现了高效的并行训练。在语言建模基准测试中，CASTLE在不同规模的模型上均优于标准因果注意力机制，降低了验证困惑度，并在一系列下游任务中提升了性能。

English

In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

前瞻键因果注意力机制

Causal Attention with Lookahead Keys

摘要

Support