ルックアヘッドキーを用いた因果的アテンション

要旨

標準的な因果的アテンションでは、各トークンのクエリ、キー、および値（QKV）は静的であり、先行するコンテキストのみをエンコードします。本研究では、コンテキストが展開するにつれて各トークンのキーを継続的に更新するアテンションメカニズムであるCAuSal aTtention with Lookahead kEys（CASTLE）を提案します。これらの更新されたキーを先読みキーと呼びます。なぜなら、それらは以前の位置に属しながらも、それらの位置に対して相対的に後に現れるトークンからの情報を統合し、かつ自己回帰特性を厳密に保持するためです。このメカニズムは逐次的に見えますが、各位置で先読みキーを明示的に実体化することなく、効率的な並列訓練を可能にする数学的等価性を導出します。言語モデリングのベンチマークにおいて、CASTLEはモデルスケールにわたって標準的な因果的アテンションを一貫して上回り、検証パープレキシティを低減し、さまざまな下流タスクでの性能を向上させます。

English

In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

ルックアヘッドキーを用いた因果的アテンション

Causal Attention with Lookahead Keys

要旨

Support