이산 정책 최적화를 위한 가이던스 대조 토큰 신용 할당

초록

그룹 이점 기반 강화 학습 방법(예: GRPO 및 DAPO)은 수학적 추론 및 텍스트-이미지 생성 등 다양한 영역에서 강력한 성능을 입증했다. 그러나 이러한 방법은 샘플 수준의 보상에 의존하기 때문에, 모든 토큰에 대해 동일한 크레딧 할당을 적용하여 세분화된 토큰 수준의 기여도를 포착하지 못하는 핵심적인 한계가 있다. 이 문제를 해결하기 위해, 우리는 긍정 및 부정 프롬프트 하에서 모델 예측을 대조하여 토큰별 크레딧 할당을 가능하게 하는 새로운 알고리즘인 Guidance Contrastive Policy Optimization(GCPO)을 제안한다. GCPO는 샘플 수준의 이점을 균일하게 분배하는 대신, 이러한 대조 예측 간의 차이에 비례하는 토큰 수준의 이점을 할당함으로써 더 정밀하고 유용한 학습 신호를 제공한다. 실험적으로, GCPO는 텍스트-이미지 생성에서 텍스트 프롬프트와 정렬된 시각적 영역과 같은 의미적으로 관련된 영역과, 사고 사슬(chain-of-thought) 과제에서 추론 과정 내의 중요한 키워드를 강조함을 확인하였다. 광범위한 실험을 통해 GCPO는 텍스트-이미지 생성 및 사고 사슬 추론 벤치마크 모두에서 GRPO 및 DAPO 기준선을 지속적으로 능가하며, 이산 정책 학습을 위한 일반적이고 확장 가능한 최적화 전략으로서의 효과성을 입증한다.

English

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.