NOSA:原生与可卸载的稀疏注意力机制
NOSA: Native and Offloadable Sparse Attention
October 15, 2025
作者: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu
cs.AI
摘要
可训练的稀疏注意力已成为解决大语言模型(LLMs)在长上下文处理中解码效率瓶颈的有力方案,它显著减少了内存访问,同时对任务性能影响甚微。然而,现有稀疏注意力方法仍存在一个关键限制:键值(KV)缓存的大小并未缩减,这限制了GPU上的批量大小,尤其是在大规模批量推理中,降低了解码吞吐量。本文中,我们发现可训练的稀疏注意力在相邻解码步骤间自然展现出强烈的令牌选择局部性,从而使得在不改变底层注意力计算的情况下实现KV缓存卸载成为可能。然而,这种固有的局部性尚不足以实现高效卸载,因为选定的KV对在CPU与GPU之间的传输仍主导着整体解码成本。基于这一洞察,我们提出了NOSA,一个专为原生支持KV缓存卸载而设计的可训练稀疏注意力框架。NOSA通过将令牌选择分解为查询感知与查询无关的组件,引入了显式的局部性约束,从而在保持与训练期间相同注意力计算的同时,减少了KV传输。我们使用NOSA预训练了一个10亿参数的模型,并进行了广泛的基准测试,结果表明,它在保持近乎无损性能的同时,相比基础的可训练稀疏注意力基线(InfLLM-V2),解码吞吐量最高提升了2.3倍。
English
Trainable sparse attention has emerged as a promising solution to address the
decoding efficiency bottleneck of LLMs in long-context processing,
significantly saving memory accesses while minimally impacting task
performance. However, existing sparse attention methods leave a crucial
limitation unresolved: the size of the key-value (KV) cache remains unreduced,
which constrains on-GPU batch sizes and throttles decoding throughput,
especially in large-scale batched inference. In this paper, we show that
trainable sparse attention naturally exhibits strong locality in token
selection across adjacent decoding steps, thereby enabling KV cache offloading
without altering the underlying attention computation. However, the inherent
locality remains insufficient to achieve efficient offloading, as the transfer
of selected KV pairs between the CPU and GPU continues to dominate the overall
decoding cost. Building on this insight, we present NOSA, a trainable sparse
attention framework designed to natively support KV cache offloading. NOSA
introduces explicit locality constraints by decomposing token selection into
query-aware and query-agnostic components, thereby reducing KV transfers while
preserving the same attention computation as used during training. We pretrain
a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that
it preserves near-lossless performance while achieving up to a 2.3x improvement
in decoding throughput compared with the vanilla trainable sparse attention
baseline (InfLLM-V2).