CLIPO: 정책 최적화에서의 대조 학습이 RLVR을 일반화한다

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시켰습니다. 그러나 RLVR은 최종 답변만을 결과 보상으로 삼아 중간 추론 단계의 정확성을 간과합니다. 과정은 틀렸지만 결과는 맞는 롤아웃 데이터로 학습할 경우, 환각(hallucination) 현상과 답변 복사 문제가 발생해 모델의 일반화 성능과 강건성을 심각하게 저해할 수 있습니다. 이를 해결하기 위해 우리는 정책 최적화에 대조 학습 기법을 접목한 CLIPO를 도입하여 RLVR 과정을 일반화합니다. 성공적인 롤아웃에 대한 대조 손실을 최적화함으로써, CLIPO는 LLM이 올바른 추론 경로들 간에 공유되는 불변의 구조를 포착하도록 유도합니다. 이는 RLVR의 기존 단일 경로 감독 방식보다 강건한 교차 궤적 정규화(cross-trajectory regularization)를 제공하며, 단계별 추론 불일치를 효과적으로 완화하고 환각 아티팩트를 억제합니다. 실험에서 CLIPO는 다양한 추론 벤치마크에서 여러 RLVR 기준 모델들을 일관되게 개선하며, LLM 정책 최적화의 일반화와 강건성 측면에서 균일한 성능 향상을 입증했습니다. 코드와 학습 레시피는 https://github.com/Qwen-Applications/CLIPO에서 확인할 수 있습니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model's generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

CLIPO: 정책 최적화에서의 대조 학습이 RLVR을 일반화한다

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

초록

Support