목표 정책 최적화

초록

강화 학습에서 주어진 프롬프트에 대해 모델로부터 여러 개의 완성문을 샘플링하고 점수를 매깁니다. 이어지는 두 가지 질문은 다음과 같습니다: 어떤 완성문이 확률 질량을 획득해야 하는지, 그리고 그 변화를 실현하기 위해 매개변수는 어떻게 이동해야 하는지입니다. 표준 정책 경사 방법은 두 질문을 동시에 답하므로, 학습률, 클리핑 및 기타 옵티마이저 선택에 따라 업데이트가 과도하거나 부족할 수 있습니다. 우리는 두 질문을 분리하는 Target Policy Optimization(TPO)을 소개합니다. 점수가 매겨진 완성문이 주어지면, TPO는 대상 분포 q_i ∝ p_i^{,old} exp(u_i)를 구성하고 교차 엔트로피를 통해 정책을 이에 맞춥니다. 샘플링된 완성문 로짓에 대한 손실 기울기는 p^θ - q이며, 정책이 대상과 일치하면 사라집니다. 표 형식의 밴딧 문제, 변환기 시퀀스 작업, 그리고 수십억 개의 매개변수를 가진 LLM RLVR에서 TPO는 쉬운 작업에서는 PG, PPO, GRPO, DG와 성능이 비슷하지만, 희소 보상 조건에서는 이들 방법을 크게 능가합니다. 코드는 https://github.com/JeanKaddour/tpo에서 확인할 수 있습니다.

English

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce Target Policy Optimization (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i propto p_i^{,old} exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is p^θ- q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.