關於基於KL正則化策略梯度算法的大語言模型推理設計

摘要

策略梯度算法已成功應用於增強大型語言模型（LLMs）的推理能力。儘管在策略梯度算法中廣泛使用Kullback-Leibler（KL）正則化來穩定訓練，但系統性地探索不同KL散度公式如何被估計並整合到在線強化學習（RL）的代理損失函數中，呈現出一個細緻且可系統性探索的設計空間。在本文中，我們提出了正則化策略梯度（RPG），這是一個在在線RL設置下推導和分析KL正則化策略梯度方法的系統框架。我們推導了由正向和反向KL散度正則化的目標的策略梯度及相應的代理損失函數，考慮了歸一化和非歸一化的策略分佈。此外，我們還展示了完全可微的損失函數以及REINFORCE風格的梯度估計器的推導，以適應多樣的算法需求。我們使用這些方法在LLM推理的RL上進行了廣泛的實驗，結果顯示在訓練穩定性和性能方面相比於GRPO、REINFORCE++和DAPO等強基線，取得了改進或競爭性的結果。代碼可在https://github.com/complex-reasoning/RPG獲取。

English

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.

關於基於KL正則化策略梯度算法的大語言模型推理設計

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

摘要

Support