策略镜像下降中对数配分函数的逼近诱导大语言模型后训练中的隐式正则化
Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training
February 5, 2026
作者: Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
cs.AI
摘要
策略镜像下降(PMD)通过迭代求解KL正则化的策略改进子问题,为强化学习(RL)提供了理论框架。尽管该方法已应用于训练Kimi K1.5/K2等先进大语言模型(LLM),但理想闭式PMD更新需要可靠的配分函数估计——这在LLM庞大动作空间中仅能获取有限交互数据时构成重大挑战。我们研究了一种名为PMD-mean的实用算法,该算法使用采样策略下的平均奖励逼近对数配分项,并在对数策略空间执行回归。具体而言,我们刻画了PMD-mean的总体解,并证明其隐式优化了具有自适应混合KL-χ²正则项的镜像下降子问题。这种额外的χ²正则化通过约束大幅概率变动,在期望奖励较低时产生更保守的更新,从而增强对有限样本估计误差的鲁棒性。数学推理任务实验表明,PMD-mean以更高的稳定性和时间效率实现了优越性能。这些发现深化了我们对PMD-mean的理解,为LLM强化学习算法的理论改进指明了路径。代码已发布于https://github.com/horizon-rl/OpenKimi。
English
Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--χ^2 regularizer. This additional χ^2 regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.