다중 토큰 어텐션

초록

소프트 어텐션(Soft Attention)은 LLM(Large Language Models)이 주어진 컨텍스트 내에서 관련 부분을 찾아내는 데 중요한 메커니즘으로 작용합니다. 그러나 개별 어텐션 가중치는 단일 쿼리와 키 토큰 벡터 간의 유사성에 의해 결정됩니다. 이러한 "단일 토큰 어텐션"은 컨텍스트 내에서 관련 부분을 구별하는 데 사용되는 정보의 양을 제한하는 병목 현상을 초래합니다. 이 문제를 해결하기 위해, 우리는 새로운 어텐션 방법인 멀티 토큰 어텐션(Multi-Token Attention, MTA)을 제안합니다. MTA는 LLM이 여러 쿼리와 키 벡터를 동시에 고려하여 어텐션 가중치를 결정할 수 있도록 합니다. 이는 쿼리, 키, 그리고 헤드에 컨볼루션 연산을 적용함으로써 인접한 쿼리와 키가 서로의 어텐션 가중치에 영향을 미치게 하여 더 정밀한 어텐션을 가능하게 합니다. 결과적으로, 우리의 방법은 단일 벡터의 용량을 초과할 수 있는 더 풍부하고 세밀한 정보를 활용하여 관련 컨텍스트를 찾아낼 수 있습니다. 광범위한 평가를 통해, MTA가 다양한 인기 벤치마크에서 향상된 성능을 달성함을 입증했습니다. 특히, 표준 언어 모델링 작업과 긴 컨텍스트 내에서 정보를 검색해야 하는 작업에서 Transformer 기반 모델을 능가하며, 우리 방법이 더 풍부한 정보를 활용할 수 있는 능력이 특히 유용함을 보여줍니다.

English

Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This "single token attention" bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector's capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method's ability to leverage richer information proves particularly beneficial.