ProRL: 長期強化学習が大規模言語モデルの推論能力の境界を拡張

要旨

推論中心の言語モデルにおける最近の進展は、検証可能な報酬とモデルを整合させるための有望な手法として、強化学習（RL）に注目を集めています。しかし、RLが実際にモデルの推論能力を拡張するのか、それともベースモデルの分布に既に潜在している高報酬の出力を単に増幅するだけなのか、そしてRLの計算リソースを継続的に拡大することが確実に推論性能の向上につながるのかについては、依然として議論の余地があります。本研究では、従来の仮定に挑戦し、長時間にわたるRL（ProRL）トレーニングが、ベースモデルではアクセスできない新しい推論戦略を発見できることを実証します。ProRLは、KLダイバージェンス制御、参照ポリシーのリセット、多様なタスクスイートを組み込んだ新しいトレーニング手法です。我々の実証分析により、RLでトレーニングされたモデルが、pass@k評価の広範な範囲でベースモデルを一貫して上回ることが明らかになりました。これは、ベースモデルが試行回数に関わらず完全に失敗するシナリオにおいても同様です。さらに、推論境界の改善は、ベースモデルのタスク能力とトレーニング期間と強く相関しており、RLが時間の経過とともに新しい解空間の領域を探索し、埋め尽くすことができることを示唆しています。これらの発見は、RLが言語モデルの推論境界を意味的に拡張する条件についての新たな洞察を提供し、推論のための長期的なRLに関する将来の研究の基盤を確立します。我々は、さらなる研究を支援するためにモデルの重みを公開します: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

English

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

ProRL: 長期強化学習が大規模言語モデルの推論能力の境界を拡張

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

要旨

Support