AceReason-Nemotron：通过强化学习推进数学与代码推理能力

摘要

尽管大规模强化学习（RL）在推理领域取得了最新进展，但构建高性能推理模型的训练方案仍不明确。前沿模型（如DeepSeek-R1）的关键实现细节，包括数据筛选策略和RL训练方案，往往被省略。此外，近期研究表明，对于较小模型，蒸馏方法仍比RL更为有效。在本研究中，我们证明大规模RL能显著增强中小型强模型的推理能力，其成果超越了基于蒸馏的最先进模型。我们通过大量消融实验系统研究了RL训练过程，并提出了一种简单而有效的方法：先对仅数学提示进行训练，再对仅代码提示进行训练。值得注意的是，我们发现仅数学RL不仅显著提升了强蒸馏模型在数学基准上的表现（例如，7B/14B模型在AIME 2025上分别提升了14.6%/17.2%），还提升了代码推理任务的表现（例如，7B/14B模型在LiveCodeBench上分别提升了6.8%/5.8%）。此外，延长的仅代码RL迭代进一步提高了代码基准上的性能，而对数学结果的影响微乎其微或没有影响。我们开发了一个稳健的数据筛选管道，用于收集具有高质量、可验证答案和测试用例的挑战性提示，以支持跨领域的基于验证的RL。最后，我们识别出关键的实验洞察，包括逐步增加响应长度的课程学习以及策略内参数更新的稳定效果。我们发现，RL不仅激发了模型在预训练和监督微调（如蒸馏）期间获得的基础推理能力，还突破了模型推理能力的极限，使其能够解决之前无法解决的问题。

English

Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

AceReason-Nemotron：通过强化学习推进数学与代码推理能力

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

摘要

Support