列表式策略优化：基于组的RLVR作为LLM响应单纯形上的目标投影

摘要

基于可验证奖励的强化学习（RLVR）已成为大型语言模型（LLMs）后训练中激励推理能力的标准方法。在现有方案中，基于组的策略梯度方法较为普遍，即为每个提示采样一组响应，并通过组相对优势信号更新策略。本研究揭示了这些优化策略共享一个共同的几何结构：它们隐式地在响应单纯形上定义一个目标分布，并通过一阶近似向该目标投影。基于此洞察，我们提出了列表式策略优化（LPO），显式地进行目标投影，通过将近端强化学习目标约束在响应单纯形上厘清隐式目标，进而通过精确散度最小化进行策略投影。该框架提供了：（i）在列表式目标上的单调改进，同时具备有界、零和及自校正的投影梯度；（ii）通过解耦的投影步骤，能够灵活选择具有不同结构特性的散度。在多种推理任务和LLM骨干模型上，LPO在匹配目标下显著优于典型策略梯度基线的训练性能，同时固有地保持了优化稳定性与响应多样性。

English

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

列表式策略优化：基于组的RLVR作为LLM响应单纯形上的目标投影

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

摘要

Support