基于最优奖励基准的同策略强化学习
On-Policy RL with Optimal Reward Baseline
May 29, 2025
作者: Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, Furu Wei
cs.AI
摘要
强化学习算法对于将大型语言模型与人类偏好对齐并提升其推理能力至关重要。然而,当前的强化学习算法常因宽松的在线策略约束导致训练不稳定,以及辅助模型带来的计算效率低下问题。在本研究中,我们提出了一种新颖且简化的强化学习算法——基于最优奖励基线的在线策略强化学习(OPO),旨在应对这些挑战。OPO强调精确在线策略训练的重要性,实践证明这能稳定训练过程并增强探索能力。此外,OPO引入了理论上能最小化梯度方差的最优奖励基线。我们在数学推理基准上评估了OPO,结果显示其在无需额外模型或正则化项的情况下,展现出卓越的性能和训练稳定性。更重要的是,OPO实现了更低的策略偏移和更高的输出熵,促进了更加多样且减少重复的响应生成。这些成果表明,OPO为大型语言模型对齐与推理任务中的稳定有效强化学习开辟了有前景的方向。实现代码已发布于https://github.com/microsoft/LMOps/tree/main/opo。
English
Reinforcement learning algorithms are fundamental to align large language
models with human preferences and to enhance their reasoning capabilities.
However, current reinforcement learning algorithms often suffer from training
instability due to loose on-policy constraints and computational inefficiency
due to auxiliary models. In this work, we propose On-Policy RL with Optimal
reward baseline (OPO), a novel and simplified reinforcement learning algorithm
designed to address these challenges. OPO emphasizes the importance of exact
on-policy training, which empirically stabilizes the training process and
enhances exploration. Moreover, OPO introduces the optimal reward baseline that
theoretically minimizes gradient variance. We evaluate OPO on mathematical
reasoning benchmarks. The results demonstrate its superior performance and
training stability without additional models or regularization terms.
Furthermore, OPO achieves lower policy shifts and higher output entropy,
encouraging more diverse and less repetitive responses. These results highlight
OPO as a promising direction for stable and effective reinforcement learning in
large language model alignment and reasoning tasks. The implementation is
provided at https://github.com/microsoft/LMOps/tree/main/opo.Summary
AI-Generated Summary