要約トークンによる簡略化スパース注意

要旨

スパース注意機構は長いコンテクストの推論コストを削減できるが、ほとんどの派生手法は新たなアーキテクチャコンポーネントを導入する。本稿では、アーキテクチャの変更を必要としない、よりシンプルなスパース注意機構であるSimplified Sparse Attention（SSA）を提案する。具体的には、まず要約トークン（gist tokens）をインターリーブした系列で継続事前学習を実施する。標準的な次トークン損失は通常通り最適化するが、要約トークンには注意マスクを用いて言語モデルが注目できるコンテクストの範囲を制限する。これにより、各チャンクの重要な情報を要約トークンに集約するようモデルに学習させる。推論時、SSAは現在のクエリと少数の要約トークン間の注意を介してチャンクをスコアリングし、上位kチャンクに対応する生トークンを再導入して選択的に展開する。クエリは要約トークンとのみスコアリングされるため、フルKVキャッシュに対する単純なスコアリングに伴うメモリ帯域幅コストを回避でき、スパース注意手法で用いられる補助KVキャッシュのアプローチも必要としない。LongBenchにおいて、SSAは同一圧縮率のもとで圧縮ベースラインおよび推論時スパース注意ベースラインを一貫して上回る。さらに顕著なことに、検索拡張生成（RAG）において、SSAは継続事前学習後、フル注意機構を5.7ポイント以上上回ることさえある。これは、SSAの選択的展開がクエリに関連するチャンクに注意を集中させ、ノイズを効果的に除去する能力に起因する。SSAはさらに階層的なgist-of-gistバリアント（H-SSA）に拡張され、最大32倍の高圧縮率において対数線形の復号化複雑性を達成しつつ、精度を維持または向上させる。コードはhttps://github.com/yuzhenmao/simplified-sparse-attention/で入手可能である。

English

Sparse attention can reduce the cost of long-context inference, but most variants introduce new architectural components. We introduce Simplified Sparse Attention (SSA), a simpler approach to sparse attention that requires no architectural changes. Concretely, we first perform continued pretraining on sequences interleaved with gist tokens. We optimize the standard next-token loss as usual, but the gist tokens use an attention mask to restrict what parts of the context the language model can attend to; this teaches the model to pack each chunk's important information into the gist tokens. At inference time, SSA scores chunks via attention between the current query and the small set of gist tokens, selectively unfolding the top-k chunks by reintroducing their corresponding raw tokens. Since the query is scored only against the gist tokens, we avoid the memory-bandwidth cost associated with naive scoring against the full KV cache, without requiring the auxiliary KV cache approach used by sparse attention methods. On LongBench, SSA consistently outperforms compression and inference-time sparse-attention baselines under the same compression ratio. More strikingly, in retrieval-augmented generation, SSA can even outperform full attention after continued pretraining by over 5.7 points. We attribute this to the ability of SSA's selective unfolding, which concentrates attention on the query-relevant chunks and effectively filters out noise. SSA further extends to a hierarchical gist-of-gist variant (H-SSA) that achieves log-linear decoding complexity while maintaining or improving accuracy at high compression ratios up to 32x. The code is available at https://github.com/yuzhenmao/simplified-sparse-attention/.