可训练动态掩码稀疏注意力机制

摘要

在大规模语言模型中，对长上下文建模的需求持续增长，但标准自注意力机制的二次方复杂度往往成为瓶颈。尽管现有的稀疏注意力机制已提升了效率，但仍可能面临静态模式或信息丢失等问题。我们提出了一种可训练的动态掩码稀疏注意力机制——动态掩码注意力（Dynamic Mask Attention, DMA），它有效利用了内容感知与位置感知的稀疏性。DMA通过两大创新实现这一目标：首先，它从值表示中动态生成内容感知的稀疏掩码，使模型能够自适应地识别并聚焦于关键信息；其次，它实施了位置感知的稀疏注意力计算，有效跳过了不必要的计算区域。这种双重稀疏设计使得模型在保留完整信息的同时，显著降低了重要信息的计算复杂度，实现了信息保真度与计算效率之间的优异平衡。我们通过全面实验验证了DMA的性能。对比研究表明，在Chinchilla缩放法则设置下，DMA在困惑度指标上优于多头注意力、滑动窗口注意力、多头潜在注意力及原生稀疏注意力。此外，在具有挑战性的多查询关联召回任务中，DMA同样展现出优于这些方法的性能与效率。尤为关键的是，在1.7B参数模型的评估中，DMA在标准基准性能及极具挑战性的“大海捞针”任务上均显著超越多头注意力。这些实验结果凸显了DMA在平衡模型效率与长上下文建模能力方面的卓越能力。

English

In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

可训练动态掩码稀疏注意力机制

Trainable Dynamic Mask Sparse Attention

摘要

Support