MiniMax 희소 어텐션

초록

초장거리 컨텍스트 능력은 최첨단 LLM(대규모 언어 모델)에 필수적인 요소가 되고 있습니다. 에이전트 기반 워크플로, 저장소 규모의 코드 추론, 지속적 메모리 모두 수십만에서 수백만 개의 토큰에 걸쳐 모델이 공동으로 주의를 기울여야 하지만, 소프트맥스 주의의 이차 비용으로 인해 배포 규모에서 이를 실현하기 어렵습니다. 본 논문에서는 그룹화된 질의 주의(GQA)를 기반으로 구축된 블록 단위 희소 주의인 MiniMax Sparse Attention(MSA)을 소개합니다. 경량 인덱스 분기가 키-값 블록을 점수화하고 각 GQA 그룹에 대해 독립적으로 Top-k 하위 집합을 선택하여 그룹별 희소 검색을 가능하게 하면서 효율적인 블록 수준 실행을 유지합니다. 이후 메인 분기는 선택된 블록에 대해서만 정확한 블록 희소 주의를 수행합니다. 단순성과 확장성이라는 원칙을 바탕으로 설계된 MSA는 의도적으로 간소화되어 다양한 GPU에서 효율적으로 배포하기 용이합니다. 희소성을 실제 속도 향상으로 전환하기 위해, 우리는 MSA와 함께 지수 함수 없는 Top-k 선택 및 KV-외부 희소 주의를 사용하여 블록 세분화 접근 하에서 텐서 코어 활용률을 개선하는 GPU 실행 경로를 공동 설계했습니다. 네이티브 멀티모달 학습이 적용된 109B 파라미터 모델에서 MSA는 GQA와 동등한 성능을 보이면서 1M 컨텍스트에서 토큰당 주의 연산을 28.4배 감소시킵니다. 공동 설계된 커널과 함께 MSA는 H800에서 14.2배의 프리필 및 7.6배의 디코딩 벽시계 속도 향상을 달성합니다. 추론 커널은 https://github.com/MiniMax-AI/MSA에서 확인할 수 있으며, MSA로 구동되는 프로덕션 등급의 네이티브 멀티모달 모델은 https://huggingface.co/MiniMaxAI/MiniMax-M3에서 공개적으로 출시되었습니다.

English

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.