ChatPaper.aiChatPaper

AReaL:面向语言推理的大规模异步强化学习系统

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

May 30, 2025
作者: Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
cs.AI

摘要

强化学习(Reinforcement Learning, RL)已成为训练大型语言模型(Large Language Models, LLMs)的一种趋势性范式,尤其是在推理任务中。针对LLMs的有效RL训练需要大规模并行化,并迫切需求高效的训练系统。现有大多数面向LLMs的大规模RL系统采用同步方式,在批量设置中交替进行生成与训练,其中每个训练批次的轨迹由同一(或最新)模型生成。这种做法虽能稳定RL训练,却存在严重的系统效率低下问题。生成过程必须等待批次中最长的输出完成后才能进行模型更新,导致GPU利用率不足。我们提出了AReaL,一个完全异步的RL系统,彻底解耦了生成与训练。AReaL中的轨迹生成工作者持续不断地生成新输出而无需等待,而训练工作者则在收集到一批数据时立即更新模型。AReaL还集成了一系列系统级优化措施,显著提高了GPU利用率。为了稳定RL训练,AReaL通过平衡轨迹生成与训练工作者的工作量来控制数据陈旧度,并采用了一种增强陈旧度处理的PPO变体,以更好地处理过时的训练样本。在数学与代码推理基准上的大量实验表明,与同等GPU数量下最佳的同步系统相比,AReaL实现了高达2.57倍的训练加速,同时保持了匹配甚至更优的最终性能。AReaL的代码可在https://github.com/inclusionAI/AReaL/获取。
English
Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.57times training speedup compared to the best synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
PDF212June 3, 2025