ChatPaper.aiChatPaper

RLEP:基於經驗回放的強化學習用於大語言模型推理

RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

July 10, 2025
作者: Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou
cs.AI

摘要

大型語言模型的強化學習(RL)是一項耗能巨大的任務:訓練過程可能不穩定,且策略可能逐漸偏離其預訓練權重。我們提出了RLEP——基於經驗回放的強化學習——這是一個兩階段框架,首先收集經過驗證的軌跡,然後在後續訓練中重播這些軌跡。在每次更新步驟中,策略會在混合了新生成軌跡與這些回放成功案例的小批量數據上進行優化。通過重播高質量示例,RLEP引導模型遠離無效探索,將學習集中在有潛力的推理路徑上,從而實現更快的收斂和更強的最終性能。在Qwen2.5-Math-7B基礎模型上,RLEP以顯著更少的更新次數達到了基準峰值準確率,並最終超越之,將AIME-2024的準確率從38.2%提升至39.9%,AIME-2025從19.8%提升至22.3%,AMC-2023從77.0%提升至82.2%。我們的代碼、數據集和檢查點已公開於https://github.com/Kwai-Klear/RLEP,以便於重現性和進一步研究。
English
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present RLEP\, -- \,Reinforcement Learning with Experience rePlay\, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
PDF31July 17, 2025