關於基於KL正則化策略梯度算法的大語言模型推理設計
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
May 23, 2025
作者: Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew C Yao
cs.AI
摘要
策略梯度算法已成功應用於增強大型語言模型(LLMs)的推理能力。儘管在策略梯度算法中廣泛使用Kullback-Leibler(KL)正則化來穩定訓練,但系統性地探索不同KL散度公式如何被估計並整合到在線強化學習(RL)的代理損失函數中,呈現出一個細緻且可系統性探索的設計空間。在本文中,我們提出了正則化策略梯度(RPG),這是一個在在線RL設置下推導和分析KL正則化策略梯度方法的系統框架。我們推導了由正向和反向KL散度正則化的目標的策略梯度及相應的代理損失函數,考慮了歸一化和非歸一化的策略分佈。此外,我們還展示了完全可微的損失函數以及REINFORCE風格的梯度估計器的推導,以適應多樣的算法需求。我們使用這些方法在LLM推理的RL上進行了廣泛的實驗,結果顯示在訓練穩定性和性能方面相比於GRPO、REINFORCE++和DAPO等強基線,取得了改進或競爭性的結果。代碼可在https://github.com/complex-reasoning/RPG獲取。
English
Policy gradient algorithms have been successfully applied to enhance the
reasoning capabilities of large language models (LLMs). Despite the widespread
use of Kullback-Leibler (KL) regularization in policy gradient algorithms to
stabilize training, the systematic exploration of how different KL divergence
formulations can be estimated and integrated into surrogate loss functions for
online reinforcement learning (RL) presents a nuanced and systematically
explorable design space. In this paper, we propose regularized policy gradient
(RPG), a systematic framework for deriving and analyzing KL-regularized policy
gradient methods in the online RL setting. We derive policy gradients and
corresponding surrogate loss functions for objectives regularized by both
forward and reverse KL divergences, considering both normalized and
unnormalized policy distributions. Furthermore, we present derivations for
fully differentiable loss functions as well as REINFORCE-style gradient
estimators, accommodating diverse algorithmic needs. We conduct extensive
experiments on RL for LLM reasoning using these methods, showing improved or
competitive results in terms of training stability and performance compared to
strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at
https://github.com/complex-reasoning/RPG.Summary
AI-Generated Summary