マルチトークンアテンション

要旨

ソフトアテンションは、大規模言語モデル（LLM）が与えられたコンテキスト内で関連する部分を見つけるための重要なメカニズムです。しかし、個々のアテンションの重みは、単一のクエリとキートークンベクトルの類似性によってのみ決定されます。この「単一トークンアテンション」は、関連する部分をコンテキストの他の部分から区別するために使用される情報量を制限してしまいます。この問題を解決するため、我々は新しいアテンション手法であるマルチトークンアテンション（MTA）を提案します。MTAでは、LLMが複数のクエリとキーベクトルに基づいてアテンションの重みを同時に条件付けできるようにします。これは、クエリ、キー、およびヘッドに対して畳み込み操作を適用することで実現され、近接するクエリとキーが互いのアテンションの重みに影響を与え、より精密なアテンションを可能にします。その結果、我々の手法は、単一のベクトルの容量を超える、より豊かでニュアンスのある情報を使用して関連するコンテキストを見つけることができます。広範な評価を通じて、MTAがさまざまな人気ベンチマークで性能向上を達成することを実証しました。特に、標準的な言語モデリングタスクや、長いコンテキスト内で情報を検索する必要があるタスクにおいて、Transformerのベースラインモデルを上回り、我々の手法がより豊かな情報を活用する能力が特に有効であることが示されました。

English

Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.