AceReason-Nemotron：透過強化學習推進數學與程式碼推理

摘要

儘管大規模強化學習（RL）在推理領域取得了近期進展，但構建高性能推理模型的訓練方案仍難以捉摸。前沿模型（如DeepSeek-R1）的關鍵實現細節，包括數據策劃策略和RL訓練方案，往往被省略。此外，近期研究表明，對於較小模型而言，蒸餾仍比RL更為有效。在本研究中，我們證明了大規模RL能顯著增強中小型強力模型的推理能力，其成果超越了基於蒸餾的頂尖模型。我們通過大量消融實驗系統地研究了RL訓練過程，並提出了一種簡單而有效的方法：先僅在數學提示上訓練，再僅在代碼提示上訓練。值得注意的是，我們發現僅數學RL不僅顯著提升了強蒸餾模型在數學基準上的表現（例如，7B/14B模型在AIME 2025上分別提升了14.6%/17.2%），還提升了代碼推理任務的表現（例如，7B/14B模型在LiveCodeBench上分別提升了6.8%/5.8%）。此外，延長的僅代碼RL迭代進一步提高了代碼基準上的性能，而數學結果幾乎不受影響或無下降。我們開發了一個穩健的數據策劃管道，用於收集具有高質量、可驗證答案和測試用例的挑戰性提示，以支持跨領域的基於驗證的RL。最後，我們識別了關鍵的實驗洞察，包括逐步增加響應長度的課程學習以及策略上參數更新的穩定效應。我們發現，RL不僅激發了模型在預訓練和有監督微調（如蒸餾）期間獲得的基礎推理能力，還推動了模型推理能力的極限，使其能夠解決之前無法解決的問題。

English

Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

AceReason-Nemotron：透過強化學習推進數學與程式碼推理

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

摘要

Support