基于经验回放的大语言模型高效强化学习训练

摘要

尽管经验回放——即存储训练轨迹并在训练过程中多次重复使用的做法——是通用强化学习的基础技术，但在大语言模型后训练领域却鲜有探索，这主要源于业界普遍认为新鲜的同策略数据对实现高性能至关重要。本研究对这一假设提出挑战。我们系统性地研究了回放缓冲区在LLM后训练中的应用，将其最优设计形式化为陈旧性导致的方差、样本多样性以及生成过程的高计算成本三者之间的权衡。研究结果表明，当生成成本高昂时，严格的同策略采样并非最优选择。通过实证分析，我们证明精心设计的回放缓冲区能大幅降低推理计算量，同时不会降低模型性能（在某些情况下甚至能提升性能），并保持策略熵的稳定性。

English

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.