HySparse: Een hybride sparse attention-architectuur met orakeltokenselectie en KV-cache-deling

Samenvatting

Dit werk introduceert Hybride Sparse Attention (HySparse), een nieuwe architectuur die elke full attention-laag afwisselt met meerdere sparse attention-lagen. Hoewel conceptueel eenvoudig, leidt HySparse strategisch de tokenselectie en KV-caches voor elke sparse laag direct af van de voorafgaande full attention-laag. Deze architectuur lost twee fundamentele beperkingen van eerdere sparse attention-methoden op. Ten eerste vertrouwen conventionele benaderingen typisch op aanvullende proxies om tokenbelangrijkheid te voorspellen, wat extra complexiteit en potentieel suboptimale prestaties introduceert. HySparse gebruikt daarentegen de full attention-laag als een precieze orakel om belangrijke tokens te identificeren. Ten tweede verminderen bestaande sparse attention-ontwerpen vaak de rekenkracht zonder de KV-cache te besparen. HySparse stelt sparse attention-lagen in staat om de full attention KV-cache te hergebruiken, waardoor zowel rekenkracht als geheugen worden gereduceerd. We evalueren HySparse op zowel 7B dense als 80B MoE-modellen. In alle settings presteert HySparse consistent beter dan zowel full attention- als hybride SWA-baselines. Opmerkelijk is dat in het 80B MoE-model met in totaal 49 lagen, slechts 5 lagen full attention gebruiken, toch behaalt HySparse aanzienlijke prestatieverbeteringen terwijl de KV-cache-opslag met bijna 10x wordt verminderd.

English

This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

HySparse: Een hybride sparse attention-architectuur met orakeltokenselectie en KV-cache-deling

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

Samenvatting

Support