MiniMax-M1: 라이트닝 어텐션을 활용한 테스트 시간 계산 효율적 확장

초록

세계 최초의 오픈 웨이트(open-weight) 대규모 하이브리드 어텐션 추론 모델인 MiniMax-M1을 소개합니다. MiniMax-M1은 하이브리드 Mixture-of-Experts(MoE) 아키텍처와 라이트닝 어텐션 메커니즘이 결합된 방식으로 구동됩니다. 이 모델은 총 4560억 개의 파라미터를 포함하며, 토큰당 459억 개의 파라미터가 활성화되는 이전 모델인 MiniMax-Text-01을 기반으로 개발되었습니다. M1 모델은 기본적으로 100만 토큰의 컨텍스트 길이를 지원하며, 이는 DeepSeek R1의 컨텍스트 크기의 8배에 해당합니다. 또한, MiniMax-M1의 라이트닝 어텐션 메커니즘은 테스트 시간 계산의 효율적인 확장을 가능하게 합니다. 이러한 특성으로 인해 M1은 긴 입력을 처리하고 깊이 사고해야 하는 복잡한 작업에 특히 적합합니다. MiniMax-M1은 샌드박스 기반의 실제 소프트웨어 엔지니어링 환경을 포함한 다양한 문제에 대해 대규모 강화 학습(RL)을 사용하여 학습되었습니다. M1의 RL 학습에 대한 내재적인 효율성 이점 외에도, 우리는 RL 효율성을 더욱 향상시키기 위한 새로운 RL 알고리즘인 CISPO를 제안합니다. CISPO는 토큰 업데이트 대신 중요도 샘플링 가중치를 클리핑함으로써 다른 경쟁 RL 변형들을 능가합니다. 하이브리드 어텐션과 CISPO의 결합은 MiniMax-M1의 전체 RL 학습이 512개의 H800 GPU에서 단 3주 만에 완료되도록 하며, 이때의 임대 비용은 단 534,700달러에 불과합니다. 우리는 각각 40K와 80K의 사고 예산을 가진 MiniMax-M1 모델의 두 가지 버전을 공개하며, 40K 모델은 80K 학습의 중간 단계를 나타냅니다. 표준 벤치마크에서의 실험 결과, 우리의 모델은 원본 DeepSeek-R1 및 Qwen3-235B와 같은 강력한 오픈 웨이트 모델들과 비교할 만하거나 더 우수한 성능을 보였으며, 특히 복잡한 소프트웨어 엔지니어링, 도구 활용, 그리고 긴 컨텍스트 작업에서 강점을 보였습니다. MiniMax-M1은 https://github.com/MiniMax-AI/MiniMax-M1에서 공개적으로 제공됩니다.

English

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

MiniMax-M1: 라이트닝 어텐션을 활용한 테스트 시간 계산 효율적 확장

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

초록

Support