Trainbaar Dynamisch Masker Sparse Attention

Samenvatting

In grote taalmodellen neemt de vraag naar het modelleren van lange contexten voortdurend toe, maar de kwadratische complexiteit van het standaard self-attention-mechanisme vormt vaak een knelpunt. Hoewel bestaande sparse attention-mechanismen de efficiëntie hebben verbeterd, kunnen ze nog steeds problemen ondervinden zoals statische patronen of informatieverlies. Wij introduceren een trainbaar dynamisch masker sparse attention-mechanisme, Dynamic Mask Attention (DMA), dat effectief gebruikmaakt van inhoudsbewuste en positiebewuste sparsity. DMA bereikt dit door twee belangrijke innovaties: Ten eerste genereert het dynamisch inhoudsbewuste sparse maskers uit waarderepresentaties, waardoor het model kritieke informatie adaptief kan identificeren en erop kan focussen. Ten tweede implementeert het positiebewuste sparse attention-berekeningen die effectief onnodige berekeningsregio's overslaan. Dit dual-sparsity-ontwerp stelt het model in staat om de rekencomplexiteit van belangrijke informatie aanzienlijk te verminderen, terwijl volledige informatie behouden blijft, waardoor een uitstekende balans wordt bereikt tussen informatiegetrouwheid en rekenkundige efficiëntie. We hebben de prestaties van DMA geverifieerd door middel van uitgebreide experimenten. Vergelijkende studies tonen aan dat DMA multi-head attention, sliding window attention, multi-head latent attention en native sparse attention overtreft wat betreft perplexiteit onder Chinchilla Scaling Law-instellingen. Bovendien toont DMA in uitdagende multi-query associatieve herinneringstaken ook superieure prestaties en efficiëntie vergeleken met deze methoden. Cruciaal is dat in de evaluatie van een model met 1,7 miljard parameters, DMA multi-head attention significant overtreft in zowel standaard benchmarkprestaties als de uitdagende needle-in-a-haystack-taak. Deze experimentele resultaten benadrukken het vermogen om model efficiëntie en lange-context modelleringsvermogen effectief in balans te brengen.

English

In large language models, the demand for modeling long contexts is constantly increasing, but the quadratic complexity of the standard self-attention mechanism often becomes a bottleneck. Although existing sparse attention mechanisms have improved efficiency, they may still encounter issues such as static patterns or information loss. We introduce a trainable dynamic mask sparse attention mechanism, Dynamic Mask Attention, which effectively utilizes content-aware and position-aware sparsity. DMA achieves this through two key innovations: First, it dynamically generates content-aware sparse masks from value representations, enabling the model to identify and focus on critical information adaptively. Second, it implements position-aware sparse attention computation that effectively skips unnecessary calculation regions. This dual-sparsity design allows the model to significantly reduce the computational complexity of important information while retaining complete information, achieving an excellent balance between information fidelity and computational efficiency. We have verified the performance of DMA through comprehensive experiments. Comparative studies show that DMA outperforms multi-head attention, sliding window attention, multi-head latent attention, and native sparse attention in terms of perplexity under Chinchilla Scaling Law settings. Moreover, in challenging multi-query associative recall tasks, DMA also demonstrates superior performance and efficiency compared to these methods. Crucially, in the evaluation of a 1.7B parameter model, DMA significantly outperforms multi-head attention in both standard benchmark performance and the challenging needle-in-a-haystack task. These experimental results highlight its capability to balance model efficiency and long-context modeling ability effectively.

Trainbaar Dynamisch Masker Sparse Attention

Trainable Dynamic Mask Sparse Attention

Samenvatting

Support