MiniMax 稀疏注意力
MiniMax Sparse Attention
June 11, 2026
作者: Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Pengyu Zhao
cs.AI
摘要
超长上下文能力正成为前沿大语言模型不可或缺的特性:智能体工作流、仓库级代码推理和持久记忆都需要模型共同关注数十万到数百万个词元,然而软注意力机制的二次方复杂度使得这一需求在大规模部署中难以实现。我们提出了MiniMax稀疏注意力(MSA),这是一种基于分组查询注意力(GQA)构建的分块稀疏注意力机制。一个轻量级的索引分支对键值块进行评分,并为每个GQA组独立选择Top-k子集,从而实现分组特定的稀疏检索,同时保持高效的块级执行;主分支则仅对所选块执行精确的块稀疏注意力。MSA的设计遵循简洁与可扩展原则,经过刻意精简,使其能够轻松地在多种GPU上高效部署。为了将稀疏性转化为实际的加速效果,我们将MSA与GPU执行路径协同设计,该路径使用无指数运算的Top-k选择和KV外部稀疏注意力,以提升块粒度访问下的张量核心利用率。在一个拥有原生多模态训练的109B参数模型上,MSA实现了与GQA相当的性能,同时在1M上下文下将每词元注意力计算量减少了28.4倍。结合我们协同设计的内核,MSA在H800上实现了14.2倍的预填充加速和7.6倍的解码端到端加速。我们的推理内核可在以下地址获取:https://github.com/MiniMax-AI/MSA。一个基于MSA、原生多模态的生产级模型已在以下地址公开发布:https://huggingface.co/MiniMaxAI/MiniMax-M3。
English
Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.