ProRL：持续强化学习拓展大语言模型的推理边界

摘要

近期，以推理为核心的语言模型研究进展突显了强化学习（RL）作为一种与可验证奖励对齐模型的有前景方法。然而，关于RL是否真正扩展了模型的推理能力，还是仅仅放大了基础模型分布中已存在的高奖励输出，以及持续增加RL计算资源是否能可靠地提升推理性能，仍存在争议。在本研究中，我们通过展示长期RL（ProRL）训练能够发掘基础模型即使经过大量采样也无法触及的新推理策略，挑战了现有假设。我们提出了ProRL，一种融合了KL散度控制、参考策略重置及多样化任务集的新型训练方法。实证分析表明，经过RL训练的模型在广泛的pass@k评估中持续超越基础模型，包括那些基础模型无论尝试多少次都完全失败的情境。我们进一步揭示，推理边界的提升与基础模型的任务胜任度及训练时长密切相关，暗示RL能够随时间探索并填充解决方案空间的新区域。这些发现为理解RL在何种条件下能实质性地扩展语言模型的推理边界提供了新见解，并为未来面向长期推理的RL研究奠定了基础。我们已发布模型权重以支持进一步研究：https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B。

English

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

ProRL：持续强化学习拓展大语言模型的推理边界

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

摘要

Support