トレーナブルな動的マスク疎注意機構

要旨

大規模言語モデルにおいて、長い文脈をモデル化する必要性は常に高まっているが、標準的なセルフアテンションメカニズムの二次的な計算複雑性がしばしばボトルネックとなる。既存のスパースアテンションメカニズムは効率を改善してきたものの、静的パターンや情報損失といった問題に直面することがある。本論文では、コンテンツ認識型および位置認識型のスパース性を効果的に活用する、学習可能な動的マスクスパースアテンションメカニズム「Dynamic Mask Attention（DMA）」を提案する。DMAは、2つの主要な革新を通じてこれを実現する。第一に、値表現からコンテンツ認識型のスパースマスクを動的に生成し、モデルが重要な情報を適応的に特定し集中できるようにする。第二に、不要な計算領域を効果的にスキップする位置認識型のスパースアテンション計算を実装する。この二重スパース設計により、モデルは重要な情報の計算複雑性を大幅に削減しつつ、完全な情報を保持し、情報の忠実性と計算効率の優れたバランスを達成する。我々は、包括的な実験を通じてDMAの性能を検証した。比較研究によると、DMAはChinchilla Scaling Lawの設定下で、マルチヘッドアテンション、スライディングウィンドウアテンション、マルチヘッド潜在アテンション、およびネイティブスパースアテンションをパープレキシティの点で上回る。さらに、挑戦的なマルチクエリ連想想起タスクにおいても、DMAはこれらの手法と比較して優れた性能と効率を示す。特に、1.7Bパラメータモデルの評価において、DMAは標準ベンチマーク性能と困難な「干し草の山の中の針」タスクの両方でマルチヘッドアテンションを大幅に上回る。これらの実験結果は、DMAがモデル効率と長文脈モデリング能力を効果的にバランスさせる能力を有することを強調している。

English

In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

トレーナブルな動的マスク疎注意機構

Trainable Dynamic Mask Sparse Attention

要旨

Support