CLIPO：策略优化中的对比学习泛化RLVR

摘要

基于可验证奖励的强化学习（RLVR）显著提升了大型语言模型（LLM）的推理能力。然而，RLVR仅依赖最终答案作为结果奖励，忽略了中间推理步骤的正确性。对这类过程错误但结果正确的推演轨迹进行训练，可能导致幻觉答案和答案复制问题，严重损害模型的泛化能力与鲁棒性。为解决这一问题，我们在策略优化中引入对比学习机制（CLIPO）以泛化RLVR过程。通过优化成功轨迹的对比损失，CLIPO引导LLM捕捉正确推理路径间共享的不变结构。相较于RLVR原有的单一路径监督，该方法提供了更稳健的跨轨迹正则化，有效缓解步骤级推理不一致性并抑制幻觉伪影。实验表明，在多样化推理基准测试中，CLIPO持续提升了多种RLVR基线方法，显著增强了LLM策略优化的泛化性与鲁棒性。相关代码及训练方案已开源：https://github.com/Qwen-Applications/CLIPO。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.