M1: Mamba 추론 모델을 통한 확장 가능한 테스트 타임 컴퓨팅을 향하여

초록

효과적인 추론은 복잡한 수학 문제를 해결하는 데 있어 핵심적인 요소입니다. 최근 대규모 언어 모델(LLMs)은 긴 사고 사슬(chain-of-thought) 추론을 통해 테스트 시간 계산을 확장함으로써 성능을 크게 향상시켰습니다. 그러나 트랜스포머 기반 모델은 이차 계산 복잡성과 선형 메모리 요구 사항으로 인해 컨텍스트 길이를 확장하는 데 본질적인 한계가 있습니다. 본 논문에서는 Mamba 아키텍처를 기반으로 한 새로운 하이브리드 선형 RNN 추론 모델인 M1을 소개하며, 이는 메모리 효율적인 추론을 가능하게 합니다. 우리의 접근 방식은 기존 추론 모델로부터의 지식 증류 과정을 활용하며, RL 훈련을 통해 더욱 강화됩니다. AIME 및 MATH 벤치마크에서의 실험 결과는 M1이 이전의 선형 RNN 모델을 능가할 뿐만 아니라, 유사한 규모의 최첨단 Deepseek R1 증류 추론 모델의 성능과도 맞먹음을 보여줍니다. 또한, 우리는 고성능 범용 추론 엔진인 vLLM과의 생성 속도를 비교했을 때, 동일한 크기의 트랜스포머 대비 3배 이상의 속도 향상을 관찰했습니다. 처리량 속도 향상을 통해, 우리는 고정된 생성 시간 예산 하에서 자체 일관성 투표(self-consistency voting)를 사용하여 DeepSeek R1 증류 트랜스포머 추론 모델보다 더 높은 정확도를 달성할 수 있었습니다. 전반적으로, 우리는 하이브리드 Mamba 추론 모델을 소개하고, 자체 일관성 또는 긴 사고 사슬 추론을 사용하여 테스트 시간 생성을 확장하는 더 효과적인 접근 방식을 제시합니다.

English

Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

M1: Mamba 추론 모델을 통한 확장 가능한 테스트 타임 컴퓨팅을 향하여

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

초록

Support