重新审视可验证奖励强化学习中的样本极性问题
Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
December 25, 2025
作者: Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, Jun Zhou
cs.AI
摘要
大型推理模型通常采用可验证奖励的强化学习进行训练,以提升其推理能力。该范式通过正负两种极性的自生成推演轨迹来更新策略。本文系统研究了不同样本极性对RLVR训练动态及行为的影响,发现正样本能强化现有正确推理模式,而负样本则促进新推理路径的探索。我们进一步探究了在样本级和标记级调整正负样本优势值对训练的影响,据此提出自适应非对称的标记级优势塑造策略优化方法A3PO,该方法能针对不同极性更精准地分配优势信号至关键标记。在五个推理基准测试上的实验验证了本方法的有效性。
English
Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.