ProRL:延長式強化學習拓展大型語言模型的推理邊界
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
May 30, 2025
作者: Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong
cs.AI
摘要
近期以推理為核心的語言模型進展,凸顯了強化學習(RL)作為一種對齊模型與可驗證獎勵的潛力方法。然而,關於RL是否真正擴展了模型的推理能力,還是僅僅放大了基礎模型分佈中已潛藏的高獎勵輸出,以及持續增加RL計算資源是否能可靠地提升推理性能,這些問題仍存在爭議。在本研究中,我們通過展示長時間的RL(ProRL)訓練能夠揭示基礎模型即使經過大量採樣也無法觸及的新推理策略,挑戰了現有的假設。我們提出了ProRL,這是一種新穎的訓練方法,它結合了KL散度控制、參考策略重置以及多樣化的任務集。我們的實證分析表明,經過RL訓練的模型在廣泛的pass@k評估中持續超越基礎模型,包括那些基礎模型無論嘗試多少次都完全失敗的情境。我們進一步展示了推理邊界的改善與基礎模型的任務能力及訓練時長強相關,這表明RL能夠隨著時間的推移探索並填充解決方案空間的新區域。這些發現為理解RL在何種條件下能有意義地擴展語言模型的推理邊界提供了新的見解,並為未來關於長期視野RL用於推理的研究奠定了基礎。我們發布了模型權重以支持進一步研究:https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B。
English
Recent advances in reasoning-centric language models have highlighted
reinforcement learning (RL) as a promising method for aligning models with
verifiable rewards. However, it remains contentious whether RL truly expands a
model's reasoning capabilities or merely amplifies high-reward outputs already
latent in the base model's distribution, and whether continually scaling up RL
compute reliably leads to improved reasoning performance. In this work, we
challenge prevailing assumptions by demonstrating that prolonged RL (ProRL)
training can uncover novel reasoning strategies that are inaccessible to base
models, even under extensive sampling. We introduce ProRL, a novel training
methodology that incorporates KL divergence control, reference policy
resetting, and a diverse suite of tasks. Our empirical analysis reveals that
RL-trained models consistently outperform base models across a wide range of
pass@k evaluations, including scenarios where base models fail entirely
regardless of the number of attempts. We further show that reasoning boundary
improvements correlates strongly with task competence of base model and
training duration, suggesting that RL can explore and populate new regions of
solution space over time. These findings offer new insights into the conditions
under which RL meaningfully expands reasoning boundaries in language models and
establish a foundation for future work on long-horizon RL for reasoning. We
release model weights to support further research:
https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B