Trainable Dynamic Mask Sparse Attention

papers.abstract

Bei großen Sprachmodellen steigt der Bedarf an der Modellierung langer Kontexte ständig, doch die quadratische Komplexität des standardmäßigen Self-Attention-Mechanismus stellt oft einen Engpass dar. Obwohl bestehende spärliche Attention-Mechanismen die Effizienz verbessert haben, können sie dennoch Probleme wie statische Muster oder Informationsverlust aufweisen. Wir führen einen trainierbaren dynamischen Masken-Sparse-Attention-Mechanismus ein, Dynamic Mask Attention (DMA), der inhalts- und positionsbewusste Sparsamkeit effektiv nutzt. DMA erreicht dies durch zwei Schlüsselinnovationen: Erstens generiert es dynamisch inhaltsbewusste Sparse-Masken aus Wertedarstellungen, wodurch das Modell kritische Informationen adaptiv identifizieren und fokussieren kann. Zweitens implementiert es eine positionsbewusste Sparse-Attention-Berechnung, die unnötige Berechnungsregionen effektiv überspringt. Dieses Dual-Sparsamkeits-Design ermöglicht es dem Modell, die Rechenkomplexität wichtiger Informationen signifikant zu reduzieren, während die vollständige Information erhalten bleibt, wodurch ein hervorragendes Gleichgewicht zwischen Informationsgenauigkeit und Recheneffizienz erreicht wird. Wir haben die Leistung von DMA durch umfassende Experimente verifiziert. Vergleichende Studien zeigen, dass DMA unter den Bedingungen des Chinchilla-Skalierungsgesetzes Multi-Head-Attention, Sliding-Window-Attention, Multi-Head-Latent-Attention und native Sparse-Attention in Bezug auf Perplexität übertrifft. Darüber hinaus zeigt DMA in anspruchsvollen Multi-Query-Associative-Recall-Aufgaben ebenfalls überlegene Leistung und Effizienz im Vergleich zu diesen Methoden. Entscheidend ist, dass DMA bei der Bewertung eines 1,7-Milliarden-Parameter-Modells sowohl in der Standard-Benchmark-Leistung als auch in der anspruchsvollen „Nadel im Heuhaufen“-Aufgabe Multi-Head-Attention deutlich übertrifft. Diese experimentellen Ergebnisse unterstreichen seine Fähigkeit, Modell effizienz und die Fähigkeit zur Modellierung langer Kontexte effektiv auszubalancieren.

English

In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.