ミニマックス・スパースアテンション

要旨

超長コンテキスト対応能力は、最先端のLLMにとって不可欠になりつつある。エージェント的ワークフロー、リポジトリ規模のコード推論、持続的メモリなどは、いずれも数十万から数百万トークンにわたる共同注意をモデルに要求するが、ソフトマックスアテンションの二次コストにより、これを実運用規模で持続可能にすることは困難である。本稿では、グループ化クエリアテンション（GQA）を基盤としたブロック単位のスパースアテンションである「MiniMaxスパースアテンション（MSA）」を提案する。軽量なインデックスブランチがキー・バリューブロックをスコアリングし、GQAグループごとに独立してTop-kサブセットを選択することで、グループ固有のスパース検索を実現しつつ、効率的なブロックレベルの実行を維持する。メインブランチは、選択されたブロックのみに対して正確なブロックスパースアテンションを実行する。単純性とスケーラビリティの原則に基づいて設計されたMSAは、意図的に合理化されており、幅広いGPU上で効率的に展開することが容易である。スパース性を実用的な高速化に結びつけるため、指数関数を使用しないTop-k選択とKV-outerスパースアテンションを用いて、ブロック粒度のアクセス下でテンソルコアの利用効率を向上させるGPU実行パスとMSAを共同設計した。ネイティブマルチモーダル学習を施した109Bパラメータモデルにおいて、MSAはGQAと同等の性能を示しながら、1Mコンテキストにおいてトークンあたりのアテンション計算量を28.4倍削減する。共同設計したカーネルと組み合わせることで、MSAはH800上で14.2倍のプリフィル、および7.6倍のデコードウォールクロック高速化を達成する。推論カーネルはhttps://github.com/MiniMax-AI/MSAで公開している。また、MSAを搭載したプロダクショングレードのネイティブマルチモーダルモデルは、https://huggingface.co/MiniMaxAI/MiniMax-M3で公開されている。

English

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.