M1: Mamba推論モデルによるスケーラブルなテスト時計算の実現に向けて

要旨

効果的な推論は、複雑な数学的問題を解決する上で極めて重要である。近年の大規模言語モデル（LLMs）は、長い連鎖思考（chain-of-thought）推論を通じてテスト時の計算をスケールアップすることで性能を向上させてきた。しかし、トランスフォーマーベースのモデルは、その二次的な計算複雑性と線形のメモリ要件のため、コンテキスト長を拡張する際に本質的な制約がある。本論文では、Mambaアーキテクチャを基盤とした新しいハイブリッド線形RNN推論モデル、M1を紹介する。このモデルは、メモリ効率の良い推論を可能にする。我々のアプローチは、既存の推論モデルからの蒸留プロセスを活用し、さらにRLトレーニングによって強化されている。AIMEおよびMATHベンチマークでの実験結果は、M1が従来の線形RNNモデルを凌駕するだけでなく、同規模の最先端のDeepseek R1蒸留推論モデルと同等の性能を発揮することを示している。また、我々は生成速度を高性能な汎用推論エンジンであるvLLMと比較し、同サイズのトランスフォーマーと比べて3倍以上の高速化を観察した。スループットの高速化により、自己一貫性投票（self-consistency voting）を用いて固定生成時間予算の下で、DeepSeek R1蒸留トランスフォーマー推論モデルよりも高い精度を達成することができた。全体として、我々はハイブリッドMamba推論モデルを導入し、自己一貫性または長い連鎖思考推論を用いてテスト時の生成をスケールアップするためのより効果的なアプローチを提供する。

English

Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

M1: Mamba推論モデルによるスケーラブルなテスト時計算の実現に向けて

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

要旨

Support