多令牌注意力机制

摘要

软注意力机制是驱动大型语言模型（LLMs）在给定上下文中定位相关部分的关键机制。然而，单个注意力权重仅由单一查询与键标记向量的相似性决定。这种“单标记注意力”限制了用于区分上下文相关部分的信息量。为解决这一问题，我们提出了一种新的注意力方法——多标记注意力（MTA），它使LLMs能够同时基于多个查询和键向量来调整其注意力权重。这是通过在查询、键及注意力头上应用卷积操作实现的，使得邻近的查询和键能够相互影响各自的注意力权重，从而实现更精确的注意力分配。因此，我们的方法能够利用超越单一向量容量的更丰富、更细致的信息来定位相关上下文。通过广泛的评估，我们证明了MTA在一系列流行基准测试中实现了性能提升。特别是在标准语言建模任务以及需要在长上下文中搜索信息的任务上，MTA超越了Transformer基线模型，其中我们方法利用更丰富信息的能力展现出了显著优势。

English

Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.