경험 재생을 활용한 LLM 효율적 강화학습 훈련

초록

경험 재생(rollout을 저장하고 훈련 중 여러 번 재사용하는 방법)은 일반 강화학습의 기초적인 기법이지만, LLM 후속 훈련(post-training) 분야에서는 고성능을 위해 최신 온-정책 데이터가 필수라는 통념 때문에 거의 연구되지 않았습니다. 본 연구에서는 이러한 가정에 의문을 제기합니다. 우리는 LLM 후속 훈련을 위한 재생 버퍼에 대한 체계적인 연구를 제시하며, 최적의 설계가 오래된 데이터로 인한 분산, 샘플 다양성, 그리고 생성의 높은 계산 비용 사이의 절충 관계로 정형화됨을 밝힙니다. 생성 비용이 높은 상황에서 엄격한 온-정책 샘플링은 차선책임을 보여줍니다. 실험적으로, 잘 설계된 재생 버퍼가 정책 엔트로피를 유지하면서 최종 모델 성능을 저하시키지 않거나 경우에 따라 오히려 향상시키는 동시에 추론 계산량을 극적으로 줄일 수 있음을 입증합니다.

English

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

경험 재생을 활용한 LLM 효율적 강화학습 훈련

Efficient RL Training for LLMs with Experience Replay

초록

Support