AceReason-Nemotron 1.1：透過監督式微調與強化學習的協同效應，推進數學與程式碼推理能力

摘要

在本研究中，我們探討了監督微調（SFT）與強化學習（RL）在開發強大推理模型中的協同效應。我們首先通過兩種擴展策略來策劃SFT訓練數據：增加收集的提示數量以及每個提示生成的響應數量。這兩種方法均顯著提升了推理性能，其中擴展提示數量帶來了更為顯著的增益。隨後，我們深入探討了關於SFT與RL協同效應的以下問題：(i) 一個更強的SFT模型是否在經過大規模RL訓練後始終能帶來更好的最終性能？(ii) 在RL訓練過程中，如何為給定的SFT初始化確定合適的採樣溫度，以有效平衡探索與利用？我們的研究發現表明，只要進行有效的RL訓練，特別是當採樣溫度被精心選擇以保持溫度調整後的熵值在0.3左右時，這一設置能在探索與利用之間取得良好平衡，條件(i)成立。值得注意的是，初始SFT模型之間的性能差距在RL過程中顯著縮小。憑藉堅實的SFT基礎以及對SFT與RL協同作用的深入理解，我們的AceReason-Nemotron-1.1 7B模型顯著超越了AceReason-Nemotron-1.0，並在基於Qwen2.5-7B的推理模型中，於具有挑戰性的數學和代碼基準測試上達到了新的頂尖水平，從而證明了我們後訓練方案的有效性。我們在以下鏈接發布了模型和數據：https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B。

English

In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

AceReason-Nemotron 1.1：透過監督式微調與強化學習的協同效應，推進數學與程式碼推理能力

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

摘要

Support