可訓練的動態遮罩稀疏注意力

摘要

在大型語言模型中，對長上下文建模的需求不斷增長，但標準自注意力機制的二次方複雜度往往成為瓶頸。儘管現有的稀疏注意力機制已提升了效率，它們仍可能面臨靜態模式或信息丟失等問題。我們引入了一種可訓練的動態掩碼稀疏注意力機制——動態掩碼注意力（Dynamic Mask Attention, DMA），它有效利用了內容感知與位置感知的稀疏性。DMA通過兩項關鍵創新實現這一目標：首先，它從值表示中動態生成內容感知的稀疏掩碼，使模型能夠自適應地識別並聚焦於關鍵信息；其次，它實施了位置感知的稀疏注意力計算，有效跳過不必要的計算區域。這種雙重稀疏設計使得模型在保留完整信息的同時，顯著降低了重要信息的計算複雜度，實現了信息保真度與計算效率之間的優異平衡。我們已通過全面實驗驗證了DMA的性能。對比研究表明，在Chinchilla縮放定律設置下，DMA在困惑度方面優於多頭注意力、滑動窗口注意力、多頭潛在注意力及原生稀疏注意力。此外，在具有挑戰性的多查詢關聯回憶任務中，DMA也展現出相較於這些方法的卓越性能與效率。關鍵在於，在一個17億參數模型的評估中，DMA無論是在標準基準性能還是極具挑戰性的“大海撈針”任務上，均顯著超越多頭注意力。這些實驗結果凸顯了其在平衡模型效率與長上下文建模能力方面的強大能力。

English

In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

可訓練的動態遮罩稀疏注意力

Trainable Dynamic Mask Sparse Attention

摘要

Support