RLVR稳定性与胜者优势策略优化的梯度视角

摘要

基于可验证奖励的强化学习（RLVR）提升了语言模型的推理能力，但GRPO式优化仍容易陷入崩溃。我们通过词元级梯度动力学分析这种不稳定性，推导出一个分类体系，预测更新如何影响下一词元概率和熵。该分类体系表明，稳定性同时取决于优势符号和当前策略下的词元分布。受此发现启发，我们提出赢家优势策略优化（WAPO），一种简单的在线裁剪策略梯度目标，仅对具有正优势的完成序列进行更新。在数学推理和多跳问答基准测试中，WAPO提升了训练稳定性，并在多个模型族上与基线持平或超越基线。完整代码可在 https://github.com/layer6ai-labs/wapo 获取。

English

Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at https://github.com/layer6ai-labs/wapo.