從失敗中學習：基於可驗證獎勵的修正導向策略優化

摘要

可驗證獎勵強化學習（RLVR）已成為提升大型語言模型推理能力的有效範式。然而，RLVR訓練常受稀疏二元獎勵與薄弱信用分配所阻礙，導致最佳化訊號模糊不清，且未能充分利用失敗軌跡中所蘊含的有用資訊。為應對此挑戰，我們提出面向修正的策略最佳化（CIPO），這是一個簡單且有效的RLVR擴展，能將在策略下的失敗軌跡轉化為面向修正的監督訊號，無需依賴任何外部訊號。透過聯合最佳化來自模型自身失敗嘗試的修正樣本與標準RLVR目標，CIPO提升了學習效能，同時明確增強模型修正自身錯誤的能力。涵蓋數學推理與程式碼生成在內共11個基準的廣泛實驗證明，CIPO在推理與修正表現上均一致且顯著地優於強基線方法。此外，CIPO帶來更強的pass@K增益，表明其提升了模型的內在推理能力，而非僅是重新分配現有正確答案上的機率質量。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.