NOSA: ネイティブかつオフロード可能なスパースアテンション

要旨

訓練可能なスパースアテンションは、長文脈処理における大規模言語モデル（LLM）のデコード効率のボトルネックを解決する有望な手法として登場し、タスク性能に最小限の影響を与えながらメモリアクセスを大幅に削減します。しかし、既存のスパースアテンション手法は重要な制限を未解決のままにしています。それは、キー・バリュー（KV）キャッシュのサイズが削減されないことであり、これによりGPU上のバッチサイズが制約され、特に大規模なバッチ推論においてデコードスループットが低下します。本論文では、訓練可能なスパースアテンションが隣接するデコードステップ間でトークン選択に強い局所性を示すことを明らかにし、これにより基盤となるアテンション計算を変更することなくKVキャッシュのオフロードを可能にします。しかし、この内在的な局所性だけでは効率的なオフロードを達成するには不十分であり、選択されたKVペアのCPUとGPU間の転送が依然として全体のデコードコストを支配しています。この洞察に基づき、本論文ではKVキャッシュオフロードをネイティブにサポートする訓練可能なスパースアテンションフレームワークであるNOSAを提案します。NOSAは、トークン選択をクエリ依存およびクエリ非依存のコンポーネントに分解することで明示的な局所性制約を導入し、訓練中と同じアテンション計算を維持しながらKV転送を削減します。1BパラメータのモデルをNOSAで事前訓練し、広範なベンチマークを実施した結果、NOSAはほぼロスレスな性能を維持しつつ、従来の訓練可能なスパースアテンションベースライン（InfLLM-V2）と比較して最大2.3倍のデコードスループット向上を達成することを示しました。

English

Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).

NOSA: ネイティブかつオフロード可能なスパースアテンション

NOSA: Native and Offloadable Sparse Attention

要旨

Support