政策ミラー降下法における対数分配関数の近似がLLM事後学習に暗黙的正則化を誘導する

要旨

ポリシーミラー降下法（PMD）は、KL正則化された方策改善部分問題を反復的に解くことで、強化学習（RL）に原理に基づいた枠組みを提供する。この手法はKimi K1.5/K2のような先進的な大規模言語モデルの学習に採用されているが、理想的な閉形式のPMD更新には信頼性のある分配関数の推定が必要であり、これはLLMの膨大な行動空間において限られたロールアウトで作業する際の重大な課題である。本研究では、サンプリング方策下での平均報酬で対数分配項を近似し、対数方策空間で回帰を行う「PMD-mean」と呼ばれる実用的なアルゴリズムを検討する。具体的には、PMD-meanの集団解を特徴付け、それが適応的な混合KL-χ^2正則化器を用いたミラー降下部分問題を暗黙的に最適化することを示す。この追加のχ^2正則化は確率の大きな変化を抑制し、期待報酬が低い場合にはより保守的な更新を行い、有限サンプル推定誤差に対する頑健性を高める。数学的推論タスクにおける実験により、PMD-meanが優れた性能を達成し、安定性と時間効率が向上することを示す。これらの知見はPMD-meanの理解を深め、LLM向けRLアルゴリズムの原理に基づいた改善への道筋を示す。コードはhttps://github.com/horizon-rl/OpenKimi で公開されている。

English

Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL--χ^2 regularizer. This additional χ^2 regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.

政策ミラー降下法における対数分配関数の近似がLLM事後学習に暗黙的正則化を誘導する

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

要旨

Support