MISA: 장문맥 LLM 추론을 위한 인덱서 희소 주의 혼합

초록

딥시크 희소 어텐션(DeepSeek Sparse Attention, DSA)은 학습된 토큰별 인덱서를 도입하여 모든 프리픽스 토큰을 스코어링하고 주 어텐션에 가장 관련성 높은 토큰들을 선택함으로써 세분화된 추론 시점 희소 어텐션 분야에서 최첨단 성능을 달성한다. 표현력을 유지하기 위해 인덱서는 동일한 선택 토큰 집합을 공유하는 많은 쿼리 헤드(예: DeepSeek-V3.2에서는 64개)를 사용하며, 이러한 멀티헤드 설계가 바로 긴 컨텍스트에서 인덱서를 지배적 비용으로 만드는 요인이다. 우리는 DSA 인덱서의 직접 대체재인 MISA (Mixture of Indexer Sparse Attention, 혼합 인덱서 희소 어텐션)를 제안하며, 이는 인덱서 헤드를 전문가 혼합 풀로 취급한다. 경량 라우터는 저렴한 블록 수준 통계를 사용하여 쿼리 종속적인 소수의 활성 헤드만으로 구성된 부분집합을 선택하며, 해당 헤드만이 무거운 토큰 수준 스코어링을 수행한다. 이는 원래 인덱서 풀의 다양성을 유지하면서도 쿼리당 비용을 모든 헤드로 모든 프리픽스 토큰을 스코어링하는 것에서 소수의 라우팅된 헤드만으로 스코어링하고, 소량의 풀링된 키에 대해 계산된 무시할 수 있는 라우터 항을 추가하는 것으로 줄인다. 또한 우리는 MISA의 계층적 변형을 소개한다. 이는 라우팅된 통과를 사용하여 확장된 후보 집합을 유지한 후, 원래 DSA 인덱서로 이를 재순위화하여 최종 선택 토큰을 거의 정확히 복원한다. 단 8개의 활성 헤드와 추가 학습 없이, MISA는 DeepSeek-V3.2와 GLM-5 전반에 걸쳐 LongBench에서 밀집 DSA 인덱서와 일치하는 성능을 보이며, 각각 8배 및 4배 적은 인덱서 헤드로 실행되고, 평균적으로 HISA를 능가한다. 또한 128K 토큰 컨텍스트까지 완전한 초록색 건초더미 속 바늘 히트맵을 유지하며, 레이어당 DSA 인덱서에 의해 선택된 토큰의 92% 이상을 복원한다. 우리의 TileLang 커널은 단일 NVIDIA H200 GPU에서 DSA의 원래 인덱서 커널 대비 약 3.82배의 속도 향상을 제공한다.

English

DeepSeek Sparse Attention (DSA) sets the state of the art for fine-grained inference-time sparse attention by introducing a learned token-wise indexer that scores every prefix token and selects the most relevant ones for the main attention. To remain expressive, the indexer uses many query heads (for example, 64 on DeepSeek-V3.2) that share the same selected token set; this multi-head design is precisely what makes the indexer the dominant cost on long contexts. We propose MISA (Mixture of Indexer Sparse Attention), a drop-in replacement for the DSA indexer that treats its indexer heads as a pool of mixture-of-experts. A lightweight router uses cheap block-level statistics to pick a query-dependent subset of only a few active heads, and only those heads run the heavy token-level scoring. This preserves the diversity of the original indexer pool while reducing the per-query cost from scoring every prefix token with every head to scoring it with only a handful of routed heads, plus a negligible router term computed on a small set of pooled keys. We further introduce a hierarchical variant of MISA that uses the routed pass to keep an enlarged candidate set and then re-ranks it with the original DSA indexer to recover the final selected tokens almost exactly. With only eight active heads and no additional training, MISA matches the dense DSA indexer on LongBench across DeepSeek-V3.2 and GLM-5 while running with eight and four times fewer indexer heads respectively, and outperforms HISA on average. It also preserves fully green Needle-in-a-Haystack heatmaps up to a 128K-token context and recovers more than 92% of the tokens selected by the DSA indexer per layer. Our TileLang kernel delivers roughly a 3.82 times speedup over DSA's original indexer kernel on a single NVIDIA H200 GPU.

MISA: 장문맥 LLM 추론을 위한 인덱서 희소 주의 혼합

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

초록

Support