基于最优奖励基准的同策略强化学习

摘要

强化学习算法对于将大型语言模型与人类偏好对齐并提升其推理能力至关重要。然而，当前的强化学习算法常因宽松的在线策略约束导致训练不稳定，以及辅助模型带来的计算效率低下问题。在本研究中，我们提出了一种新颖且简化的强化学习算法——基于最优奖励基线的在线策略强化学习（OPO），旨在应对这些挑战。OPO强调精确在线策略训练的重要性，实践证明这能稳定训练过程并增强探索能力。此外，OPO引入了理论上能最小化梯度方差的最优奖励基线。我们在数学推理基准上评估了OPO，结果显示其在无需额外模型或正则化项的情况下，展现出卓越的性能和训练稳定性。更重要的是，OPO实现了更低的策略偏移和更高的输出熵，促进了更加多样且减少重复的响应生成。这些成果表明，OPO为大型语言模型对齐与推理任务中的稳定有效强化学习开辟了有前景的方向。实现代码已发布于https://github.com/microsoft/LMOps/tree/main/opo。

English

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

基于最优奖励基准的同策略强化学习

On-Policy RL with Optimal Reward Baseline

摘要

Support