AceReason-Nemotron 1.1：SFTとRLの相乗効果による数学とコード推論の進展

要旨

本研究では、強力な推論モデルを開発するための教師ありファインチューニング（SFT）と強化学習（RL）の相乗効果を調査します。まず、SFTのトレーニングデータを2つのスケーリング戦略を通じてキュレーションします。具体的には、収集したプロンプトの数を増やすことと、プロンプトごとに生成される応答の数を増やすことです。どちらのアプローチも推論性能の顕著な向上をもたらし、特にプロンプトの数をスケーリングする方がより大きな効果をもたらすことが確認されました。次に、SFTとRLの相乗効果に関する以下の疑問を探ります：(i) より強力なSFTモデルは、大規模なRLトレーニング後の最終性能を一貫して向上させるか？(ii) 与えられたSFT初期化に対して、探索と活用のバランスを効果的に取るために、RLトレーニング中の適切なサンプリング温度をどのように決定できるか？我々の調査結果は、(i)が有効なRLトレーニングが行われた場合に成り立つことを示唆しており、特にサンプリング温度が探索と活用の良いバランスを取るために温度調整エントロピーを約0.3に保つように慎重に選択された場合に顕著です。注目すべきは、RLプロセスを通じて初期SFTモデル間の性能差が大幅に縮小することです。強力なSFT基盤とSFTとRLの相乗的な相互作用に関する洞察を活用することで、我々のAceReason-Nemotron-1.1 7BモデルはAceReason-Nemotron-1.0を大幅に上回り、Qwen2.5-7Bベースの推論モデルの中で新しい最先端の性能を達成し、困難な数学およびコードベンチマークにおいて我々のポストトレーニングレシピの有効性を実証しました。モデルとデータは以下で公開しています：https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

English

In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted, particularly when the sampling temperature is carefully chosen to maintain the temperature-adjusted entropy around 0.3, a setting that strikes a good balance between exploration and exploitation. Notably, the performance gap between initial SFT models narrows significantly throughout the RL process. Leveraging a strong SFT foundation and insights into the synergistic interplay between SFT and RL, our AceReason-Nemotron-1.1 7B model significantly outperforms AceReason-Nemotron-1.0 and achieves new state-of-the-art performance among Qwen2.5-7B-based reasoning models on challenging math and code benchmarks, thereby demonstrating the effectiveness of our post-training recipe. We release the model and data at: https://huggingface.co/nvidia/AceReason-Nemotron-1.1-7B

AceReason-Nemotron 1.1：SFTとRLの相乗効果による数学とコード推論の進展

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

要旨

Support