HySparse：融合預言令牌選擇與KV快取共享的混合稀疏注意力架構

摘要

本研究提出混合稀疏注意力（HySparse）架構，該架構在每個完整注意力層之間交錯插入多個稀疏注意力層。雖然概念簡潔，但HySparse能策略性地從前置完整注意力層直接推導出每個稀疏層的詞元選擇與KV快取。此架構解決了先前稀疏注意力方法的兩大根本侷限：首先，傳統方法通常需依賴額外代理機制來預測詞元重要性，不僅增加複雜度，還可能導致次優效能；相比之下，HySparse直接以完整注意力層作為精確預判器來識別關鍵詞元。其次，現有稀疏注意力設計往往僅減少計算量卻未節省KV快取，而HySparse能讓稀疏注意力層複用完整注意力層的KV快取，從而同步降低計算與記憶體開銷。我們在70億參數的稠密模型與800億參數的混合專家（MoE）模型上驗證HySparse，所有實驗設定中其效能均穩定超越完整注意力基準與混合SWA基準。尤為突出的是，在總共49層的800億參數MoE模型中，僅需5層採用完整注意力即可實現顯著效能提升，同時將KV快取儲存量壓縮近10倍。

English

This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

HySparse：融合預言令牌選擇與KV快取共享的混合稀疏注意力架構

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

摘要

Support