ProRL: 장기 강화 학습이 대규모 언어 모델의 추론 한계를 확장하다

초록

최근 추론 중심 언어 모델의 발전은 검증 가능한 보상과 모델을 정렬시키는 유망한 방법으로서 강화 학습(RL)을 부각시켰습니다. 그러나 RL이 실제로 모델의 추론 능력을 확장하는지, 아니면 기본 모델의 분포에 이미 잠재적으로 존재하는 높은 보상의 출력을 단순히 증폭시키는지, 그리고 RL 컴퓨팅을 지속적으로 확장하는 것이 신뢰할 수 있는 추론 성능 향상으로 이어지는지에 대해서는 여전히 논쟁의 여지가 있습니다. 본 연구에서는, 광범위한 샘플링 하에서도 기본 모델이 접근할 수 없는 새로운 추론 전략을 발견할 수 있는 장기간 RL(ProRL) 훈련을 통해 기존의 가정에 도전합니다. 우리는 KL 발산 제어, 참조 정책 재설정, 그리고 다양한 작업 세트를 통합한 새로운 훈련 방법론인 ProRL을 소개합니다. 실증적 분석을 통해, RL로 훈련된 모델이 다양한 pass@k 평가에서 기본 모델을 지속적으로 능가하며, 특히 기본 모델이 시도 횟수에 관계없이 완전히 실패하는 시나리오에서도 우수한 성능을 보임을 확인했습니다. 또한, 추론 경계의 개선이 기본 모델의 작업 능력과 훈련 기간과 강한 상관관계를 보임을 통해, RL이 시간이 지남에 따라 새로운 해결 공간 영역을 탐색하고 채울 수 있음을 보여줍니다. 이러한 발견들은 RL이 언어 모델의 추론 경계를 의미 있게 확장하는 조건에 대한 새로운 통찰을 제공하며, 장기적인 추론을 위한 RL 연구의 기반을 마련합니다. 추가 연구를 지원하기 위해 모델 가중치를 공개합니다: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

English

Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

ProRL: 장기 강화 학습이 대규모 언어 모델의 추론 한계를 확장하다

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

초록

Support