ChatPaper.aiChatPaper

基于经验回放的大语言模型高效强化学习训练

Efficient RL Training for LLMs with Experience Replay

April 9, 2026
作者: Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Remi Munos
cs.AI

摘要

尽管经验回放——即存储训练轨迹并在训练过程中多次重复使用的做法——是通用强化学习的基础技术,但在大语言模型后训练领域却鲜有探索,这主要源于业界普遍认为新鲜的同策略数据对实现高性能至关重要。本研究对这一假设提出挑战。我们系统性地研究了回放缓冲区在LLM后训练中的应用,将其最优设计形式化为陈旧性导致的方差、样本多样性以及生成过程的高计算成本三者之间的权衡。研究结果表明,当生成成本高昂时,严格的同策略采样并非最优选择。通过实证分析,我们证明精心设计的回放缓冲区能大幅降低推理计算量,同时不会降低模型性能(在某些情况下甚至能提升性能),并保持策略熵的稳定性。
English
While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.
PDF91April 15, 2026