ガイダンス対照的トークン信用割り当てによる離散ポリシー最適化

要旨

GRPOやDAPOといったグループ優位性に基づく強化学習手法は、数学的推論やテキストから画像への生成など、多様な領域で優れた性能を示してきた。しかし、これらの手法はサンプルレベルの報酬に依存しており、全トークンに対して一様なクレジット割り当てを行うため、トークンレベルの細かな貢献を捉えられないという重要な制約がある。この問題に対処するため、我々はGuidance Contrastive Policy Optimization（GCPO）を提案する。GCPOは、正のプロンプトと負のプロンプトの下でのモデル予測を対比させることで、トークンごとのクレジット割り当てを可能にする新規アルゴリズムである。GCPOはサンプルレベルの優位性を一様に伝達するのではなく、これらの対比的予測間の差に比例したトークンレベルの優位性を割り当てることで、より精密で情報量の多い学習信号を提供する。実験的に、GCPOはテキストから画像への生成においてはテキストプロンプトと整合する視覚領域など、意味的に関連する領域を強調し、連鎖的推論タスクでは推論過程内の重要なキーワードに焦点を当てることが確認された。広範な実験を通じて、GCPOはテキストから画像への生成および連鎖的推論の両ベンチマークにおいてGRPOやDAPOのベースラインを一貫して上回り、離散的な方策学習における汎用的でスケーラブルな最適化戦略としての有効性を示している。

English

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.