최적 보상 기준선을 사용한 온-폴리시 강화 학습

초록

강화 학습 알고리즘은 대규모 언어 모델을 인간의 선호에 맞추고 그 추론 능력을 향상시키는 데 필수적입니다. 그러나 현재의 강화 학습 알고리즘은 느슨한 온-정책(on-policy) 제약으로 인해 훈련 불안정성을 겪고, 보조 모델로 인해 계산 비효율성이 발생하는 경우가 많습니다. 본 연구에서는 이러한 문제를 해결하기 위해 새로운 단순화된 강화 학습 알고리즘인 최적 보상 기준을 사용한 온-정책 강화 학습(On-Policy RL with Optimal reward baseline, OPO)을 제안합니다. OPO는 정확한 온-정책 훈련의 중요성을 강조하며, 이를 통해 훈련 과정을 안정화하고 탐색을 개선합니다. 또한, OPO는 이론적으로 그래디언트 분산을 최소화하는 최적 보상 기준을 도입합니다. 우리는 OPO를 수학적 추론 벤치마크에서 평가하였으며, 추가 모델이나 정규화 항목 없이도 우수한 성능과 훈련 안정성을 보여주는 결과를 얻었습니다. 더 나아가, OPO는 더 낮은 정책 변화와 더 높은 출력 엔트로피를 달성하여 더 다양하고 반복적이지 않은 응답을 유도합니다. 이러한 결과는 OPO가 대규모 언어 모델 정렬 및 추론 작업에서 안정적이고 효과적인 강화 학습을 위한 유망한 방향임을 보여줍니다. 구현은 https://github.com/microsoft/LMOps/tree/main/opo에서 확인할 수 있습니다.

English

Reinforcement learning algorithms are fundamental to align large language models with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO introduces the optimal reward baseline that theoretically minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in large language model alignment and reasoning tasks. The implementation is provided at https://github.com/microsoft/LMOps/tree/main/opo.

최적 보상 기준선을 사용한 온-폴리시 강화 학습

On-Policy RL with Optimal Reward Baseline

초록

Support