ChatPaper.aiChatPaper

基于经验回放的大型语言模型高效强化学习训练

Efficient RL Training for LLMs with Experience Replay

April 9, 2026
作者: Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Remi Munos
cs.AI

摘要

尽管经验回放——即存储训练轨迹并在训练过程中多次重用的技术——是通用强化学习的基石方法,但在大语言模型后训练领域却鲜有探索,这主要源于学界普遍认为新鲜的同策略数据对实现高性能至关重要。本研究对这一假设提出挑战。我们系统性地研究了回放缓冲区在大语言模型后训练中的应用,将其最优设计形式化为陈旧性导致的方差、样本多样性以及生成过程高昂计算成本之间的权衡关系。研究证明,当生成成本较高时,严格的同策略采样并非最优选择。实证结果表明,精心设计的回放缓冲区能在保持策略熵值的同时,显著降低推理计算量,且不会削弱模型最终性能——在某些情况下甚至能提升性能。
English
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
PDF91April 15, 2026