IndexCache: クロスレイヤーインデックス再利用によるスパースアテンションの高速化

要旨

長文脈エージェントワークフローは大規模言語モデルの代表的なユースケースとして登場し、推論速度とサービスコストの両面で注意力効率が重要となっています。疎な注意機構はこの課題に効果的に対処し、DeepSeek Sparse Attention（DSA）は代表的なプロダクショングレードのソリューションです。軽量なライトニングインデクサがクエリごとに最も関連性の高いトップkトークンを選択し、コア注意力計算をO(L^2)からO(Lk)に削減します。しかし、インデクサ自体はO(L^2)の計算複雑性を保持し、すべての層で独立して実行する必要があります。にもかかわらず、連続する層間でのトップk選択結果は非常に類似しています。本論文では、この層間冗長性を活用するIndexCacheを提案します。層を少数のフル層（自身のインデクサを実行）と多数の共有層（最も近いフル層のトップkインデックスを再利用）に分割します。この構成を決定・最適化するための2つの相補的アプローチを提案します。トレーニング不要のIndexCacheは、較正セットにおける言語モデリング損失を直接最小化する貪欲探索アルゴリズムによりインデクサを保持する層を選択し、重み更新を必要としません。トレーニング対応のIndexCacheは、保持された各インデクサが担当する全層の平均化された注意分布に対して学習する多層蒸留損失を導入し、単純な交互配置パターンでもフルインデクサの精度に匹敵することを可能にします。30B DSAモデルでの実験結果では、IndexCacheが品質劣化を無視できる範囲でインデクサ計算の75%を削除し、標準DSAと比較して最大1.82倍のプリフィル速度向上と1.48倍のデコード速度向上を達成しました。これらの好結果は、プロダクション規模のGLM-5モデルでの予備実験（図1）によってさらに確認されています。

English

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L^2) to O(Lk). However, the indexer itself retains O(L^2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82times prefill speedup and 1.48times decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

IndexCache: クロスレイヤーインデックス再利用によるスパースアテンションの高速化

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

要旨

Support