基於最優獎勵基線的線上策略強化學習

摘要

強化學習算法對於將大型語言模型與人類偏好對齊並提升其推理能力至關重要。然而，當前的強化學習算法常因鬆散的策略約束而導致訓練不穩定，並因輔助模型的存在而導致計算效率低下。在本研究中，我們提出了一種新穎且簡化的強化學習算法——基於最優獎勵基線的策略內強化學習（OPO），旨在應對這些挑戰。OPO強調精確的策略內訓練的重要性，這在實踐中穩定了訓練過程並增強了探索能力。此外，OPO引入了理論上能最小化梯度方差的最優獎勵基線。我們在數學推理基準上對OPO進行了評估，結果顯示其在不依賴額外模型或正則化項的情況下，展現出卓越的性能和訓練穩定性。進一步地，OPO實現了更低的策略偏移和更高的輸出熵，促進了更多樣化且不重複的回應。這些結果表明，OPO是實現大型語言模型對齊和推理任務中穩定且有效強化學習的一個有前景的方向。具體實現已提供於https://github.com/microsoft/LMOps/tree/main/opo。

English

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

基於最優獎勵基線的線上策略強化學習

On-Policy RL with Optimal Reward Baseline

摘要

Support