失敗から学ぶ：検証可能な報酬を用いた修正指向ポリシー最適化

要旨

検証可能報酬を用いた強化学習（RLVR）は、大規模言語モデルの推論能力を向上させるための効果的な手法として注目されている。しかし、RLVRの学習は疎な二値報酬と弱いクレジット割り当てによって妨げられることが多く、その結果、最適化信号が曖昧になり、失敗した軌跡に埋め込まれた有用な情報が十分に活用されない。この課題に対処するため、我々は修正指向方策最適化（CIPO）を提案する。これは、RLVRの単純かつ効果的な拡張であり、外部信号に依存することなく、オン・ポリシーの失敗軌跡を修正指向の教師信号に変換する。モデル自身の失敗試行から得られた修正サンプルを、標準的なRLVR目的関数とともに共同最適化することで、CIPOは学習効果を向上させると同時に、モデルが自身の誤りを修正する能力を明示的に強化する。数学的推論とコード生成を網羅する11のベンチマークにわたる大規模な実験により、CIPOが推論性能と修正性能の両方において、強力なベースラインを一貫して有意に上回ることが示された。さらに、CIPOはより強力なpass@Kの向上をもたらし、これは既存の正解に対する確率質量の単なる再配分ではなく、モデルの内在的な推論能力を改善していることを示している。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.