経験再生を用いた大規模言語モデルの効率的な強化学習トレーニング

要旨

経験リプレイ（ロールアウトを保存し、訓練中に複数回再利用する手法）は、一般の強化学習における基本的な技術である。しかし、高い性能には新しく方策オンで収集されたデータが不可欠であるという通説により、大規模言語モデルのポストトレーニングにおいては、その応用はほとんど検討されていない。本研究では、この通説に異を唱える。我々は、LLMポストトレーニングのためのリプレイバッファに関する体系的な研究を提示し、その最適な設計を、陳腐化による分散、サンプルの多様性、そして生成の高い計算コストというトレードオフとして定式化する。生成の計算コストが高い場合、厳密な方策オンサンプリングは最適ではないことを示す。実験により、適切に設計されたリプレイバッファが、方策のエントロピーを維持しつつ、最終的なモデル性能を低下させることなく（場合によっては向上させつつ）、推論時の計算量を劇的に削減できることを実証する。

English

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

経験再生を用いた大規模言語モデルの効率的な強化学習トレーニング

Efficient RL Training for LLMs with Experience Replay

要旨

Support