列表式策略優化：基於群組的RLVR作為在LLM回應單純形上的目標投影

摘要

基於可驗證獎勵的強化學習（RLVR）已成為大型語言模型（LLM）後訓練階段提升推理能力的標準方法。在現有方案中，基於群組的策略梯度方法廣泛應用，該方法針對每個提示取樣一組回應，並透過群組相對優勢訊號更新策略。本研究揭示這些最佳化策略共享共同的幾何結構：每個策略隱含地在回應單純形上定義目標分佈，並透過一階近似向該分佈投影。基於此洞察，我們提出列表式策略最佳化（LPO），以明確執行目標投影：透過將近端強化學習目標限制在回應單純形上來解析隱含目標，再透過精確散度最小化進行策略投影。此框架提供：（i）對列表式目標的單調改善，具有有界、零和且自我校正的投影梯度；（ii）透過解耦投影步驟，可在不同散度選擇下保持不同的結構特性。在多樣化推理任務與LLM骨幹模型上，LPO在匹配目標條件下持續優於典型策略梯度基準的訓練表現，同時本質上維持最佳化穩定性與回應多樣性。

English

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

列表式策略優化：基於群組的RLVR作為在LLM回應單純形上的目標投影

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

摘要

Support