用于离散策略优化的引导对比令牌信用分配

摘要

基于群体优势的强化学习方法，如GRPO和DAPO，已在数学推理和文本到图像生成等不同领域展现出卓越性能。然而，这类方法依赖样本级奖励，导致所有词元被赋予均等信用，无法捕捉细粒度的词元级贡献，这构成了关键局限。为解决这一问题，我们提出引导对比策略优化（GCPO）算法——一种通过对比正负提示下的模型预测来实现逐词元信用分配的新方法。GCPO并非均匀广播样本级优势，而是根据这些对比预测的差异分配词元级优势，从而提供更精准且信息量更大的学习信号。实验表明，GCPO能有效关注语义相关区域：在文本到图像生成任务中突出与文本提示对齐的视觉区域，在思维链任务中聚焦推理路径中的关键关键词。通过大量实验，GCPO在文本到图像生成和思维链推理基准上均持续优于GRPO和DAPO基线，证明了其作为离散策略学习中通用且可扩展优化策略的有效性。

English

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.