最適報酬ベースラインを用いた方策オン型強化学習

要旨

強化学習アルゴリズムは、大規模言語モデルを人間の好みに合わせ、その推論能力を向上させるために不可欠です。しかし、現在の強化学習アルゴリズムは、緩いオン・ポリシー制約による訓練の不安定性や、補助モデルによる計算効率の低さに悩まされることが多いです。本研究では、これらの課題に対処するために、新規かつ簡素化された強化学習アルゴリズムである「On-Policy RL with Optimal reward baseline (OPO)」を提案します。OPOは、厳密なオン・ポリシー訓練の重要性を強調し、経験的に訓練プロセスを安定させ、探索を強化します。さらに、OPOは理論的に勾配分散を最小化する最適報酬ベースラインを導入します。数学的推論ベンチマークでOPOを評価した結果、追加のモデルや正則化項なしで優れた性能と訓練の安定性を示しました。さらに、OPOはより低いポリシーシフトと高い出力エントロピーを達成し、より多様で反復の少ない応答を促します。これらの結果は、OPOが大規模言語モデルのアライメントと推論タスクにおける安定かつ効果的な強化学習の有望な方向性であることを示しています。実装はhttps://github.com/microsoft/LMOps/tree/main/opoで提供されています。

English

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

最適報酬ベースラインを用いた方策オン型強化学習

On-Policy RL with Optimal Reward Baseline

要旨

Support