HySparse：一种融合Oracle令牌选择与KV缓存共享的混合稀疏注意力架构

摘要

本研究提出混合稀疏注意力（HySparse）架构，通过在每个全注意力层之间插入多个稀疏注意力层实现创新。尽管概念简洁，但HySparse能从前置全注意力层中直接推导出各稀疏层的令牌选择策略与KV缓存配置。该架构解决了传统稀疏注意力方法的两大核心局限：其一，传统方法通常依赖额外代理指标预测令牌重要性，不仅引入额外复杂度还可能导致次优性能；而HySparse直接将全注意力层作为精准预言机来识别关键令牌。其二，现有稀疏注意力设计往往只降低计算量却无法节省KV缓存，而HySparse使稀疏注意力层能复用全注意力层的KV缓存，从而同步降低计算量与内存占用。我们在70亿参数稠密模型与800亿参数MoE模型上验证HySparse，所有实验设置中其性能均稳定超越全注意力基准与混合SWA基线。值得注意的是，在含49个总层数的800亿参数MoE模型中，仅需5个全注意力层即可实现显著性能提升，同时将KV缓存存储量压缩近10倍。

English

This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

HySparse：一种融合Oracle令牌选择与KV缓存共享的混合稀疏注意力架构

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

摘要

Support