目標策略優化

摘要

在強化學習中，給定提示後，我們會從模型中採樣一組補全結果並對其評分。隨之而來的兩個問題是：哪些補全結果應獲得概率質量分配？以及模型參數應如何調整以實現這種變化？標準的策略梯度方法會同時回答這兩個問題，因此更新步長可能因學習率、梯度裁剪等優化器選擇而出現過衝或欠衝。我們提出的目標策略優化（TPO）方法將這兩個問題分離處理：給定評分後的補全結果，TPO 會構建一個目標分佈 q_i ∝ p_i^{舊} exp(u_i)，並通過交叉熵將策略擬合至該分佈。採樣補全對數幾率的損失梯度為 p^θ - q，當策略與目標匹配時梯度消失。在表格型多臂老虎機、Transformer序列任務以及數十億參數LLM的RLVR任務中，TPO在簡單任務上與PG、PPO、GRPO和DG表現相當，而在稀疏獎勵環境下則顯著優於這些方法。程式碼已開源於 https://github.com/JeanKaddour/tpo。

English

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce Target Policy Optimization (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i propto p_i^{,old} exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is p^θ- q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.