CLIPO：策略优化中的对比学习泛化RLVR

摘要

基于可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLM）的推理能力。然而，RLVR仅依赖最终答案作为结果奖励，忽略了中间推理步骤的正确性。对这类过程错误但结果正确的推演轨迹进行训练，可能导致幻觉和答案复制问题，严重损害模型的泛化能力和鲁棒性。为解决此问题，我们在策略优化中引入对比学习机制（CLIPO）来泛化RLVR过程。通过优化成功轨迹上的对比损失，CLIPO引导LLM捕捉正确推理路径间共享的不变结构。相较于RLVR原有的单一路径监督，该方法提供了更稳健的跨轨迹正则化，有效缓解步骤级推理不一致性并抑制幻觉伪影。实验表明，在多样化推理基准测试中，CLIPO对多种RLVR基线模型均带来持续提升，证明了其在LLM策略优化方面对泛化性和鲁棒性的统一改进。我们的代码与训练方案已开源：https://github.com/Qwen-Applications/CLIPO。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

CLIPO：策略优化中的对比学习泛化RLVR

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

摘要

Support