MiniMax Sparse Attention

Samenvatting

Ultra-lange-contextcapaciteit wordt onmisbaar voor grensverleggende LLM's: agentische workflows, coderedenering op repository-schaal en persistent geheugen vereisen allemaal dat het model gezamenlijk aandacht besteedt aan honderdduizenden tot miljoenen tokens, maar de kwadratische kosten van softmax-attentie maken dit onhoudbaar bij implementatie op schaal. We introduceren MiniMax Sparse Attention (MSA), een bloksgewijze sparse attentie gebouwd op Gegroepeerde Queryattentie (GQA). Een lichte Indexvertakking scoort sleutel-waardeblokken en selecteert onafhankelijk een Top-k-deelverzameling voor elke GQA-groep, wat groepspecifieke sparse terugwinning mogelijk maakt met behoud van efficiënte uitvoering op blokniveau; de Hoofdvertakking voert vervolgens exacte blok-sparse attentie uit over alleen de geselecteerde blokken. Ontworpen rond een principe van eenvoud en schaalbaarheid, is MSA bewust gestroomlijnd, waardoor het eenvoudig is om efficiënt in te zetten op een breed scala aan GPU's. Om sparsity te vertalen naar praktische versnellingen, ontwerpen we MSA samen met een GPU-uitvoeringspad dat exp-vrije Top-k-selectie en KV-outer sparse attentie gebruikt om tensorcore-benutting te verbeteren bij toegang op blokgranulariteit. Op een 109B-parametermodel met native multimodale training presteert MSA vergelijkbaar met GQA, terwijl de attentie-berekening per token met 28,4x wordt verminderd bij 1M context. In combinatie met onze co-ontworpen kernel behaalt MSA 14,2x prefill- en 7,6x decoding-wandkloksnelheidsversnellingen op H800. Onze inferentie kernel is beschikbaar op: https://github.com/MiniMax-AI/MSA. Een productieklaar native multimodaal model aangedreven door MSA is openbaar uitgebracht op: https://huggingface.co/MiniMaxAI/MiniMax-M3.

English

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.