MISA: 面向长上下文大语言模型推理的索引器稀疏注意力混合方法

摘要

DeepSeek稀疏注意力（DSA）通过引入一个学习得到的逐词索引器，为每个前缀词元评分并选择最相关的词元用于主注意力，从而在细粒度推理时稀疏注意力领域达到了当前最优水平。为保持表达能力，该索引器使用多个查询头（例如DeepSeek-V3.2中为64个）共享同一组选定的词元；正是这种多头设计使得索引器在处理长上下文时成为主要计算成本。我们提出MISA（混合索引器稀疏注意力），作为DSA索引器的即插即用替代方案，将其索引器头视为一个混合专家池。一个轻量级路由器利用廉价的块级统计信息，选择仅由少数活跃头组成的查询相关子集，且仅有这些头执行高开销的词元级评分。这保留了原始索引器池的多样性，同时将每个查询的计算成本从使用所有头对所有前缀词元评分，降低为仅使用少数路由头进行评分，外加一个基于少量池化键计算的可忽略的路由项。我们还引入了MISA的分层变体，通过路由过程保留一个扩大的候选集，再使用原始DSA索引器对其进行重新排序，以近乎精确地恢复最终选定的词元。仅使用八个活跃头且无需额外训练，MISA在DeepSeek-V3.2和GLM-5的LongBench上均能与密集DSA索引器匹配，同时分别将索引器头数量减少八倍和四倍，且平均性能优于HISA。此外，MISA在高达128K词元的上下文中完全保留了绿色“大海捞针”热力图，每层可恢复DSA索引器所选词元的92%以上。我们的TileLang内核在单个NVIDIA H200 GPU上相比DSA原始索引器内核实现了约3.82倍的加速。

English

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.

MISA: 面向长上下文大语言模型推理的索引器稀疏注意力混合方法

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

摘要

Support