RLVR穩定性與優勝優勢策略優化的梯度視角

摘要

可驗證獎勵的強化學習（RLVR）能提升語言模型的推理能力，但GRPO風格的優化仍容易崩潰。我們透過詞元層級的梯度動態分析此不穩定性，推導出一套分類法，用以預測更新如何影響下一個詞元的機率與熵。該分類法指出，穩定性同時取決於當前策略下的優勢符號與詞元分佈。受此發現啟發，我們提出贏家優勢策略優化（WAPO），這是一個簡單的線上裁剪策略梯度目標，僅對正優勢的完成序列進行更新。在數學推理與多跳問答基準測試中，WAPO提升了訓練穩定性，並在多個模型系列上達到或超越基線表現。完整程式碼可於 https://github.com/layer6ai-labs/wapo 取得。

English

Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at https://github.com/layer6ai-labs/wapo.