从失败中学习:面向修正的带可验证奖励的策略优化
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
May 14, 2026
作者: Mengjie Ren, Jie Lou, Boxi Cao, Xueru Wen, Hongyu Lin, Xianpei Han, Le Sun, Xing Yu, Yaojie Lu
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为提升大型语言模型推理能力的有效范式。然而,RLVR训练常因稀疏二元奖励和弱信用分配而受阻,导致优化信号模糊,且未能充分利用失败轨迹中蕴含的有用信息。为解决这一挑战,我们提出面向纠正的策略优化(CIPO)——一种简单有效的RLVR扩展方法,可将同策略失败轨迹转化为面向纠正的监督信号,无需依赖任何外部信号。通过联合优化源自模型自身失败尝试的纠正样本与标准RLVR目标,CIPO在提高学习效率的同时,明确增强了模型自我纠错的能力。在涵盖数学推理和代码生成的11个基准上的大量实验表明,CIPO在推理和纠错性能上持续且显著地优于强基线方法。此外,CIPO实现了更强的pass@K增益,表明它提升了模型的内在推理能力,而不仅仅是重新分配现有正确答案上的概率质量。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.