AceReason-Nemotron：強化学習による数学とコード推論の進展

要旨

大規模な強化学習（RL）を用いた推論技術の最近の進展にもかかわらず、高性能な推論モデルを構築するためのトレーニングレシピは未だ確立されていません。DeepSeek-R1のような最先端モデルの主要な実装詳細、例えばデータキュレーション戦略やRLトレーニングレシピは、しばしば省略されています。さらに、最近の研究では、小規模モデルにおいては蒸留がRLよりも依然として効果的であることが示されています。本研究では、大規模なRLが強力な小規模および中規模モデルの推論能力を大幅に向上させ、最先端の蒸留ベースのモデルを凌駕する結果を達成できることを実証します。我々は、広範なアブレーションを通じてRLトレーニングプロセスを体系的に研究し、数学のみのプロンプトでトレーニングした後にコードのみのプロンプトでトレーニングするというシンプルかつ効果的なアプローチを提案します。特に、数学のみのRLは、強力な蒸留モデルの数学ベンチマーク（例えば、7B / 14BモデルでAIME 2025において+14.6% / +17.2%）だけでなく、コード推論タスク（例えば、7B / 14BモデルでLiveCodeBenchにおいて+6.8% / +5.8%）のパフォーマンスも大幅に向上させることがわかりました。さらに、コードのみのRLイテレーションを延長することで、コードベンチマークのパフォーマンスが向上し、数学の結果に最小限または全く劣化が見られませんでした。我々は、両ドメインにわたる検証ベースのRLを可能にするために、高品質で検証可能な回答とテストケースを備えた挑戦的なプロンプトを収集する堅牢なデータキュレーションパイプラインを開発しました。最後に、応答長を段階的に増やすカリキュラム学習や、オンポリシーパラメータ更新の安定化効果など、重要な実験的洞察を特定しました。RLは、事前学習や教師あり微調整（例えば、蒸留）中に獲得された基礎的な推論能力を引き出すだけでなく、モデルの推論能力の限界を押し上げ、以前は解決不可能だった問題を解決できるようにすることがわかりました。

English

Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

AceReason-Nemotron：強化学習による数学とコード推論の進展

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

要旨

Support