ChatPaper.aiChatPaper

NOSA:原生與可卸載稀疏注意力機制

NOSA: Native and Offloadable Sparse Attention

October 15, 2025
作者: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu
cs.AI

摘要

可訓練稀疏注意力已成為解決大型語言模型(LLM)在長上下文處理中解碼效率瓶頸的一種有前景的解決方案,顯著節省了記憶體訪問,同時對任務性能的影響最小。然而,現有的稀疏注意力方法未能解決一個關鍵限制:鍵值(KV)快取的大小仍未減少,這限制了GPU上的批次大小並抑制了解碼吞吐量,特別是在大規模批次推理中。本文中,我們展示了可訓練稀疏注意力在相鄰解碼步驟中的標記選擇上自然表現出強烈的局部性,從而實現了在不改變底層注意力計算的情況下進行KV快取卸載。然而,固有的局部性仍不足以實現高效的卸載,因為所選KV對在CPU和GPU之間的傳輸仍然主導著整體解碼成本。基於這一洞察,我們提出了NOSA,一個專為原生支持KV快取卸載而設計的可訓練稀疏注意力框架。NOSA通過將標記選擇分解為查詢感知和查詢無關的組件,引入了顯式的局部性約束,從而減少了KV傳輸,同時保持了與訓練期間相同的注意力計算。我們使用NOSA預訓練了一個10億參數的模型,並進行了廣泛的基準測試,結果表明它在保持近乎無損性能的同時,與基礎的可訓練稀疏注意力基線(InfLLM-V2)相比,解碼吞吐量提高了最多2.3倍。
English
Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).
PDF42October 16, 2025