目標方策最適化

要旨

強化学習において、プロンプトが与えられたとき、モデルから複数の補完をサンプリングし、それらにスコアを付けます。ここで二つの疑問が生じます：どの補完が確率質量を増やすべきか、そしてその変化を実現するためにパラメータをどのように移動させるべきか？標準的な方策勾配法は両方を同時に解決するため、学習率、クリッピング、その他のオプティマイザの選択によって更新が過大または過小になる可能性があります。本論文では、これら二つの疑問を分離するTarget Policy Optimization（TPO）を提案します。スコア付けされた補完が与えられたとき、TPOは目標分布 q_i ∝ p_i^{old} exp(u_i) を構築し、交差エントロピーを用いて方策をこれに適合させます。サンプリングされた補完のロジットに関する損失勾配は p^θ - q となり、方策が目標分布に一致すると消滅します。表形式バンディット、トランスフォーマー系列タスク、および数十億パラメータ大規模言語モデルのRLVRにおいて、TPOは容易なタスクではPG、PPO、GRPO、DGと同等の性能を発揮し、スパース報酬条件下ではそれらを大幅に上回ります。コードは https://github.com/JeanKaddour/tpo で公開されています。

English

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce Target Policy Optimization (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i propto p_i^{,old} exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is p^θ- q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.