Optimización de la Política Objetivo

Resumen

En el aprendizaje por refuerzo (RL), dado un prompt, muestreamos un grupo de terminaciones (completions) de un modelo y las puntuamos. Surgen dos preguntas: ¿qué terminaciones deberían ganar masa de probabilidad y cómo deberían moverse los parámetros para realizar ese cambio? Los métodos estándar de gradiente de política (policy-gradient) responden ambas a la vez, por lo que la actualización puede ser excesiva o insuficiente dependiendo de la tasa de aprendizaje, el recorte (clipping) y otras opciones del optimizador. Presentamos la Optimización de Política Objetivo (Target Policy Optimization, TPO), que separa las dos preguntas. Dadas las terminaciones puntuadas, TPO construye una distribución objetivo q_i ∝ p_i^{antigua} exp(u_i) y ajusta la política hacia ella mediante entropía cruzada. El gradiente de la pérdida en los logits de las terminaciones muestreadas es p^θ - q, que se anula una vez que la política coincide con el objetivo. En bandidos tabulares, tareas de secuencias con transformers y RLVR en LLMs de miles de millones de parámetros, TPO iguala a PG, PPO, GRPO y DG en tareas fáciles y supera sustancialmente a estos últimos bajo recompensa dispersa. El código está disponible en https://github.com/JeanKaddour/tpo.

English

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce Target Policy Optimization (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution q_i propto p_i^{,old} exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is p^θ- q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.