基于经验回放的大型语言模型高效强化学习训练

摘要

尽管经验回放——即存储训练轨迹并在训练过程中多次重用的技术——是通用强化学习的基石方法，但在大语言模型后训练领域却鲜有探索，这主要源于学界普遍认为新鲜的同策略数据对实现高性能至关重要。本研究对这一假设提出挑战。我们系统性地研究了回放缓冲区在大语言模型后训练中的应用，将其最优设计形式化为陈旧性导致的方差、样本多样性以及生成过程高昂计算成本之间的权衡关系。研究证明，当生成成本较高时，严格的同策略采样并非最优选择。实证结果表明，精心设计的回放缓冲区能在保持策略熵值的同时，显著降低推理计算量，且不会削弱模型最终性能——在某些情况下甚至能提升性能。

English

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

基于经验回放的大型语言模型高效强化学习训练

Efficient RL Training for LLMs with Experience Replay

摘要

Support