CLIPO: 政策最適化における対照学習はRLVRを一般化する

要旨

検証可能な報酬による強化学習（RLVR）は大規模言語モデル（LLM）の推論能力を大幅に進展させてきた。しかし、RLVRは最終回答のみを結果報酬として依存しており、中間推論ステップの正当性を看過している。プロセスが誤っているにもかかわらず結果が正しいロールアウトで学習することは、虚構生成や回答の模倣を引き起こし、モデルの汎化性と頑健性を著しく損なう恐れがある。この問題に対処するため、我々はPolicy Optimizationに対比学習機構を組み込んだCLIPOを提案し、RLVRプロセスの一般化を図る。成功ロールアウトに対する対比損失を最適化することで、CLIPOはLLMが正しい推論経路に共通する不変的構造を捉えるよう誘導する。これにより、RLVRにおける従来の単一路径監督よりも頑健な経路間正則化を実現し、ステップレベルの推論不一致を効果的に緩和するとともに、虚構的産物を抑制する。実験では、多様な推論ベンチマークにおいてCLIPOが複数のRLVRベースラインを一貫して改善し、LLMの政策最適化における汎化性と頑健性の双方で均一な向上を示した。実装コード及び学習レシピはhttps://github.com/Qwen-Applications/CLIPO で公開している。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

CLIPO: 政策最適化における対照学習はRLVRを一般化する

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

要旨

Support