AReaL:面向语言推理的大规模异步强化学习系统
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
May 30, 2025
作者: Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
cs.AI
摘要
强化学习(RL)已成为训练大规模语言模型(LLMs)的热门范式,尤其是在推理任务中。针对LLMs的有效RL训练需要大规模并行化,并迫切需求高效的训练系统。现有大多数面向LLMs的大规模RL系统采用同步方式,在批处理设置中交替进行生成与训练,每批训练中的rollout由同一(或最新)模型生成。这种方式虽稳定了RL训练,却存在严重的系统效率低下问题。生成必须等待批次中最长输出完成后才能进行模型更新,导致GPU利用率不足。我们提出了AReaL,一个完全异步的RL系统,彻底解耦了生成与训练。AReaL中的rollout工作者持续生成新输出而无需等待,而训练工作者则在收集到一批数据时随时更新模型。AReaL还集成了一系列系统级优化,显著提升了GPU利用率。为稳定RL训练,AReaL通过平衡rollout与训练工作者的工作量来控制数据陈旧度,并采用了一种增强陈旧度的PPO变体,以更好地处理过时的训练样本。在数学和代码推理基准上的大量实验表明,与同等GPU数量下最佳同步系统相比,AReaL实现了高达2.57倍的训练加速,同时保持了匹配甚至更优的最终性能。AReaL的代码已发布于https://github.com/inclusionAI/AReaL/。
English
Reinforcement learning (RL) has become a trending paradigm for training large
language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs
requires massive parallelization and poses an urgent need for efficient
training systems. Most existing large-scale RL systems for LLMs are synchronous
by alternating generation and training in a batch setting, where the rollouts
in each training batch are generated by the same (or latest) model. This
stabilizes RL training but suffers from severe system-level inefficiency.
Generation must wait until the longest output in the batch is completed before
model update, resulting in GPU underutilization. We present AReaL, a
fully asynchronous RL system that completely decouples generation from
training. Rollout workers in AReaL continuously generate new outputs without
waiting, while training workers update the model whenever a batch of data is
collected. AReaL also incorporates a collection of system-level optimizations,
leading to substantially higher GPU utilization. To stabilize RL training,
AReaL balances the workload of rollout and training workers to control data
staleness, and adopts a staleness-enhanced PPO variant to better handle
outdated training samples. Extensive experiments on math and code reasoning
benchmarks show that AReaL achieves up to 2.57times training
speedup compared to the best synchronous systems with the same number of GPUs
and matched or even improved final performance. The code of AReaL is available
at https://github.com/inclusionAI/AReaL/.Summary
AI-Generated Summary