ChatPaper.aiChatPaper

RLEP:基于经验回放的大语言模型推理强化学习

RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

July 10, 2025
作者: Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou
cs.AI

摘要

针对大规模语言模型的强化学习(RL)是一项能耗巨大的任务:训练过程可能不稳定,且策略可能逐渐偏离其预训练权重。我们提出了RLEP——基于经验回放的强化学习框架,该框架分为两个阶段:首先收集已验证的轨迹,随后在后续训练中回放这些轨迹。在每次更新步骤中,策略会在混合了新生成轨迹与这些回放成功案例的小批量数据上进行优化。通过回放高质量示例,RLEP引导模型远离无效探索,聚焦于有潜力的推理路径,从而实现更快的收敛和更强的最终性能。在Qwen2.5-Math-7B基础模型上,RLEP以显著更少的更新次数达到基线峰值准确率,并最终超越之,将AIME-2024的准确率从38.2%提升至39.9%,AIME-2025从19.8%提升至22.3%,AMC-2023从77.0%提升至82.2%。我们的代码、数据集及检查点已公开于https://github.com/Kwai-Klear/RLEP,以促进可重复性及进一步研究。
English
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present RLEP\, -- \,Reinforcement Learning with Experience rePlay\, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
PDF31July 17, 2025