ProRL：延長式強化學習拓展大型語言模型的推理邊界

摘要

近期以推理為核心的語言模型進展，凸顯了強化學習（RL）作為一種對齊模型與可驗證獎勵的潛力方法。然而，關於RL是否真正擴展了模型的推理能力，還是僅僅放大了基礎模型分佈中已潛藏的高獎勵輸出，以及持續增加RL計算資源是否能可靠地提升推理性能，這些問題仍存在爭議。在本研究中，我們通過展示長時間的RL（ProRL）訓練能夠揭示基礎模型即使經過大量採樣也無法觸及的新推理策略，挑戰了現有的假設。我們提出了ProRL，這是一種新穎的訓練方法，它結合了KL散度控制、參考策略重置以及多樣化的任務集。我們的實證分析表明，經過RL訓練的模型在廣泛的pass@k評估中持續超越基礎模型，包括那些基礎模型無論嘗試多少次都完全失敗的情境。我們進一步展示了推理邊界的改善與基礎模型的任務能力及訓練時長強相關，這表明RL能夠隨著時間的推移探索並填充解決方案空間的新區域。這些發現為理解RL在何種條件下能有意義地擴展語言模型的推理邊界提供了新的見解，並為未來關於長期視野RL用於推理的研究奠定了基礎。我們發布了模型權重以支持進一步研究：https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B。

English

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

ProRL：延長式強化學習拓展大型語言模型的推理邊界

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

摘要

Support