前瞻键因果注意力机制
Causal Attention with Lookahead Keys
September 9, 2025
作者: Zhuoqing Song, Peng Sun, Huizhuo Yuan, Quanquan Gu
cs.AI
摘要
在标准的因果注意力机制中,每个标记的查询、键和值(QKV)是静态的,仅编码先前的上下文信息。我们引入了带有前瞻键的因果注意力机制(CASTLE),该机制随着上下文的展开不断更新每个标记的键。我们将这些更新后的键称为前瞻键,因为它们属于较早的位置,却整合了相对于这些位置之后出现的标记信息,同时严格保持了自回归特性。尽管该机制看似是顺序执行的,但我们推导出了一个数学等价形式,避免了在每个位置显式生成前瞻键,从而实现了高效的并行训练。在语言建模基准测试中,CASTLE在不同规模的模型上均优于标准因果注意力机制,降低了验证困惑度,并在一系列下游任务中提升了性能。
English
In standard causal attention, each token's query, key, and value (QKV) are
static and encode only preceding context. We introduce CAuSal aTtention with
Lookahead kEys (CASTLE), an attention mechanism that continually updates each
token's keys as the context unfolds. We term these updated keys lookahead keys
because they belong to earlier positions yet integrate information from tokens
that appear later relative to those positions, while strictly preserving the
autoregressive property. Although the mechanism appears sequential, we derive a
mathematical equivalence that avoids explicitly materializing lookahead keys at
each position and enables efficient parallel training. On language modeling
benchmarks, CASTLE consistently outperforms standard causal attention across
model scales, reducing validation perplexity and improving performance on a
range of downstream tasks.