MISA：混合索引器稀疏注意力用於長上下文LLM推理

摘要

DeepSeek稀疏注意力（DSA）通过引入一個可學習的逐詞索引器來為每一個前綴詞元評分，並選取與主注意力最相關的詞元，從而達到細粒度推理時稀疏注意力的最先進水平。為了保持表達能力，該索引器使用多個查詢頭（例如DeepSeek-V3.2中的64個頭）共享同一組選定的詞元集合；正是這種多頭設計導致索引器在長上下文場景中成為主要成本。我們提出MISA（混合索引器稀疏注意力），作為DSA索引器的即插即用替代方案，將索引器頭視為一組專家混合模型。一個輕量級路由器利用低廉的區塊級統計資訊來選取查詢相關的少數活躍頭子集，僅有這些頭執行耗時的詞元級評分。這既保留了原始索引器池的多樣性，又將每次查詢的成本從用每個頭對每個前綴詞元評分，降低為僅用少數路由頭進行評分，再加上一個對少量池化鍵計算的微不足道的路由器項。我們進一步提出MISA的分層變體，利用路由傳遞來保留一個擴大的候選集，然後用原始DSA索引器對其重新排序，幾乎精確地恢復最終選定的詞元。僅使用八個活躍頭且無需額外訓練，MISA在LongBench上分別以八倍和四倍更少的索引器頭數匹配DeepSeek-V3.2和GLM-5的密集DSA索引器，並且平均表現優於HISA。它還在128K詞元上下文中保持完全綠色的大海撈針熱力圖，並逐層恢復DSA索引器選取詞元的92%以上。我們的TileLang內核在單個NVIDIA H200 GPU上比DSA原始索引器內核實現約3.82倍的加速。

English

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.

MISA：混合索引器稀疏注意力用於長上下文LLM推理

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

摘要

Support