扩展大型语言模型强化学习计算的艺术
The Art of Scaling Reinforcement Learning Compute for LLMs
October 15, 2025
作者: Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal
cs.AI
摘要
强化学习(RL)已成为训练大型语言模型(LLMs)的核心技术,然而该领域尚缺乏与预训练相媲美的预测性扩展方法论。尽管计算预算迅速增长,但如何评估算法改进以扩展RL计算仍缺乏系统性的理解。我们首次进行了大规模系统性研究,累计超过40万GPU小时,定义了一个分析并预测LLMs中RL扩展的原则性框架。我们拟合了RL训练的S型计算-性能曲线,并通过消融实验分析了多种常见设计选择对渐近性能和计算效率的影响。研究发现:(1)并非所有方案都能达到相似的渐近性能;(2)损失聚合、归一化、课程学习及离策略算法等细节主要调节计算效率,而不会显著改变渐近点;(3)稳定、可扩展的方案遵循可预测的扩展轨迹,使得从小规模运行中推断成为可能。综合这些洞见,我们提出了最佳实践方案ScaleRL,并通过单次RL运行扩展至10万GPU小时,成功验证了其性能预测的有效性。我们的工作不仅为分析RL扩展提供了科学框架,还提出了一种实用方案,使RL训练更接近预训练长期以来的可预测性。
English
Reinforcement learning (RL) has become central to training large language
models (LLMs), yet the field lacks predictive scaling methodologies comparable
to those established for pre-training. Despite rapidly rising compute budgets,
there is no principled understanding of how to evaluate algorithmic
improvements for scaling RL compute. We present the first large-scale
systematic study, amounting to more than 400,000 GPU-hours, that defines a
principled framework for analyzing and predicting RL scaling in LLMs. We fit
sigmoidal compute-performance curves for RL training and ablate a wide range of
common design choices to analyze their effects on asymptotic performance and
compute efficiency. We observe: (1) Not all recipes yield similar asymptotic
performance, (2) Details such as loss aggregation, normalization, curriculum,
and off-policy algorithm primarily modulate compute efficiency without
materially shifting the asymptote, and (3) Stable, scalable recipes follow
predictable scaling trajectories, enabling extrapolation from smaller-scale
runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and
demonstrate its effectiveness by successfully scaling and predicting validation
performance on a single RL run scaled up to 100,000 GPU-hours. Our work
provides both a scientific framework for analyzing scaling in RL and a
practical recipe that brings RL training closer to the predictability long
achieved in pre-training.