ChatPaper.aiChatPaper

強化學習計算規模化於大型語言模型中的藝術

The Art of Scaling Reinforcement Learning Compute for LLMs

October 15, 2025
作者: Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal
cs.AI

摘要

強化學習(RL)已成為訓練大型語言模型(LLMs)的核心技術,然而該領域尚缺乏與預訓練相媲美的預測性擴展方法。儘管計算預算迅速增加,但對於如何評估算法改進以擴展RL計算,仍缺乏系統性的理解。我們首次進行了大規模系統性研究,耗時超過40萬GPU小時,定義了一個分析與預測LLMs中RL擴展的理論框架。我們擬合了RL訓練的S型計算性能曲線,並對一系列常見設計選擇進行了消融實驗,以分析它們對漸近性能和計算效率的影響。我們觀察到:(1)並非所有方案都能產生相似的漸近性能,(2)諸如損失聚合、歸一化、課程學習和離策略算法等細節主要調節計算效率,而不會顯著改變漸近點,(3)穩定且可擴展的方案遵循可預測的擴展軌跡,使得從小規模運行中進行外推成為可能。結合這些洞察,我們提出了一種最佳實踐方案——ScaleRL,並通過成功擴展並預測單次RL運行在10萬GPU小時規模上的驗證性能,展示了其有效性。我們的工作既為分析RL擴展提供了科學框架,也提出了一種實用方案,使RL訓練更接近預訓練長期以來所達到的可預測性。
English
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
PDF272October 16, 2025