MISA: 長文脈LLM推論のためのインデクサースパースアテンションの混合

要旨

DeepSeek Sparse Attention（DSA）は、学習されたトークン単位のインデクサーを導入し、各プレフィックストークンをスコアリングして主アテンションに最も関連性の高いものを選択することで、細粒度の推論時スパースアテンションの最先端を打ち立てている。表現力を維持するため、インデクサーは多数のクエリヘッド（例えばDeepSeek-V3.2では64）を使用し、これらは同一の選択トークンセットを共有する。このマルチヘッド設計こそが、長いコンテキストにおいてインデクサーを支配的なコストにしている。我々はMISA（Mixture of Indexer Sparse Attention）を提案する。これはDSAインデクサーのドロップイン置換であり、そのインデクサーヘッドを混合エキスパートのプールとして扱う。軽量なルーターが、低コストのブロックレベル統計量を用いて、クエリ依存の少数のアクティブヘッドからなるサブセットを選択し、それらのヘッドのみが重いトークンレベルのスコアリングを実行する。これにより、元のインデクサープールの多様性を維持しつつ、クエリごとのコストを、すべてのヘッドですべてのプレフィックストークンをスコアリングすることから、少数のルーティングされたヘッドのみでスコアリングすることに削減し、さらにプールされたキーの小さなセットで計算される無視できるルーター項を加える。さらに、MISAの階層的変種を導入する。これはルーティングパスを使用して拡大された候補セットを保持し、それを元のDSAインデクサーで再ランク付けして、最終的に選択されたトークンをほぼ正確に復元する。わずか8つのアクティブヘッドと追加のトレーニングなしで、MISAはLongBenchにおいてDeepSeek-V3.2とGLM-5で密なDSAインデクサーに匹敵し、それぞれ8倍および4倍少ないインデクサーヘッドで動作しながら、平均的にHISAを上回る。また、128Kトークンコンテキストまでの完全なグリーンNeedle-in-a-Haystackヒートマップを維持し、各層でDSAインデクサーによって選択されたトークンの92%以上を復元する。我々のTileLangカーネルは、単一のNVIDIA H200 GPU上で、DSAの元のインデクサーカーネルと比較して約3.82倍の高速化を実現する。

English

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.