目标策略优化

摘要

在强化学习中，给定提示后，我们会从模型中采样一组补全结果并为其评分。随之产生两个问题：哪些补全结果应获得概率质量？参数应如何调整以实现这种变化？标准策略梯度方法同时回答这两个问题，因此更新可能因学习率、梯度裁剪及其他优化器选择而出现超调或欠调。我们提出目标策略优化（TPO）方法，将这两个问题分离。给定评分后的补全结果，TPO构建一个目标分布q_i ∝ p_i^{旧} exp(u_i)，并通过交叉熵使策略拟合该分布。采样补全对数几率的损失梯度为p^θ - q，当策略与目标匹配时梯度消失。在表格多臂赌博机、Transformer序列任务以及数十亿参数LLM的RLVR任务中，TPO在简单任务上与PG、PPO、GRPO和DG表现相当，而在稀疏奖励场景下显著优于这些方法。代码详见https://github.com/JeanKaddour/tpo。

English

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce Target Policy Optimization (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i propto p_i^{,old} exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is p^θ- q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.