关于LLM推理的KL正则化策略梯度算法设计

摘要

策略梯度算法已成功应用于增强大型语言模型（LLMs）的推理能力。尽管在策略梯度算法中广泛使用Kullback-Leibler（KL）正则化以稳定训练，但如何系统地探索不同KL散度公式的估计及其融入在线强化学习（RL）代理损失函数的设计空间，仍是一个细致且可系统探索的领域。本文提出正则化策略梯度（RPG），一个在在线RL环境下推导和分析KL正则化策略梯度方法的系统框架。我们推导了针对前向和反向KL散度正则化目标的策略梯度及其相应的代理损失函数，同时考虑了归一化和非归一化的策略分布。此外，我们还展示了完全可微损失函数以及REINFORCE风格梯度估计器的推导，以满足多样化的算法需求。我们利用这些方法在LLM推理的RL任务上进行了广泛实验，结果表明在训练稳定性和性能方面相比GRPO、REINFORCE++和DAPO等强基线方法，取得了改进或具有竞争力的结果。代码已发布于https://github.com/complex-reasoning/RPG。

English

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.

关于LLM推理的KL正则化策略梯度算法设计

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

摘要

Support