多令牌注意力機制
Multi-Token Attention
April 1, 2025
作者: Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar
cs.AI
摘要
軟注意力機制是驅動大型語言模型(LLMs)定位給定上下文中相關部分的關鍵機制。然而,單個注意力權重僅由單一查詢與鍵值標記向量的相似度決定。這種“單標記注意力”限制了用於區分上下文相關部分與其他部分的信息量。為解決這一問題,我們提出了一種新的注意力方法——多標記注意力(Multi-Token Attention, MTA),該方法使LLMs能夠同時基於多個查詢與鍵值向量來調整其注意力權重。這是通過在查詢、鍵值及注意力頭上應用卷積操作實現的,使得鄰近的查詢與鍵值能夠相互影響彼此的注意力權重,從而實現更精確的注意力分配。因此,我們的方法能夠利用超越單一向量容量的更豐富、更細膩的信息來定位相關上下文。通過廣泛的評估,我們證明了MTA在一系列流行基準測試中實現了性能提升。值得注意的是,它在標準語言建模任務上超越了Transformer基線模型,在需要於長上下文中搜索信息的任務中,我們方法利用更豐富信息的能力尤其顯現出其優勢。
English
Soft attention is a critical mechanism powering LLMs to locate relevant parts
within a given context. However, individual attention weights are determined by
the similarity of only a single query and key token vector. This "single token
attention" bottlenecks the amount of information used in distinguishing a
relevant part from the rest of the context. To address this issue, we propose a
new attention method, Multi-Token Attention (MTA), which allows LLMs to
condition their attention weights on multiple query and key vectors
simultaneously. This is achieved by applying convolution operations over
queries, keys and heads, allowing nearby queries and keys to affect each
other's attention weights for more precise attention. As a result, our method
can locate relevant context using richer, more nuanced information that can
exceed a single vector's capacity. Through extensive evaluations, we
demonstrate that MTA achieves enhanced performance on a range of popular
benchmarks. Notably, it outperforms Transformer baseline models on standard
language modeling tasks, and on tasks that require searching for information
within long contexts, where our method's ability to leverage richer information
proves particularly beneficial.Summary
AI-Generated Summary