AceReason-Nemotron 1.1：通过监督微调与强化学习的协同作用推进数学与代码推理

摘要

在本研究中，我们探讨了监督微调（SFT）与强化学习（RL）在开发强大推理模型中的协同效应。我们首先通过两种扩展策略来精心准备SFT训练数据：增加收集的提示数量以及每个提示生成的响应数量。这两种方法均显著提升了推理性能，其中增加提示数量的策略带来了更为显著的提升。随后，我们深入研究了SFT与RL协同作用下的两个关键问题：(i) 一个更强的SFT模型是否在大规模RL训练后总能带来更优的最终性能？(ii) 在RL训练过程中，如何为给定的SFT初始化确定合适的采样温度，以有效平衡探索与利用？我们的研究结果表明，只要进行有效的RL训练，特别是当采样温度被精心选择以保持温度调整后的熵值在0.3左右时，问题(i)的答案是肯定的，这一设置很好地平衡了探索与利用。值得注意的是，在RL过程中，初始SFT模型之间的性能差距显著缩小。依托于坚实的SFT基础以及对SFT与RL协同作用的深刻理解，我们的AceReason-Nemotron-1.1 7B模型在AceReason-Nemotron-1.0的基础上实现了显著超越，并在基于Qwen2.5-7B的推理模型中，在数学和代码等挑战性基准测试上达到了新的最先进水平，从而验证了我们后训练方案的有效性。我们已在以下地址发布模型与数据：https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B。

English

In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

AceReason-Nemotron 1.1：通过监督微调与强化学习的协同作用推进数学与代码推理

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

摘要

Support