多令牌注意力機制

摘要

軟注意力機制是驅動大型語言模型（LLMs）定位給定上下文中相關部分的關鍵機制。然而，單個注意力權重僅由單一查詢與鍵值標記向量的相似度決定。這種“單標記注意力”限制了用於區分上下文相關部分與其他部分的信息量。為解決這一問題，我們提出了一種新的注意力方法——多標記注意力（Multi-Token Attention, MTA），該方法使LLMs能夠同時基於多個查詢與鍵值向量來調整其注意力權重。這是通過在查詢、鍵值及注意力頭上應用卷積操作實現的，使得鄰近的查詢與鍵值能夠相互影響彼此的注意力權重，從而實現更精確的注意力分配。因此，我們的方法能夠利用超越單一向量容量的更豐富、更細膩的信息來定位相關上下文。通過廣泛的評估，我們證明了MTA在一系列流行基準測試中實現了性能提升。值得注意的是，它在標準語言建模任務上超越了Transformer基線模型，在需要於長上下文中搜索信息的任務中，我們方法利用更豐富信息的能力尤其顯現出其優勢。

English

Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.