GCPO：コントラストが機能しない時は、ゴールドを目指せ

要旨

強化学習は、大規模言語モデルの推論能力を向上させるために広く応用されてきました。特に、より小規模なモデルの推論限界を拡張することは、重要な研究テーマとなっています。しかし、Group Relative Policy Optimization (GRPO) のようなアルゴリズムには明らかな欠点があります。モデルのロールアウト応答の上限はモデル自体によって完全に決定されるため、すべてが誤っているか、すべてが正しいサンプルから知識を獲得することができません。本論文では、外部の標準参照回答を組み込んだ Group Contrastive Policy Optimization (GCPO) を提案します。モデルが問題を解決できない場合、参照回答が正しい応答を提供し、モデルを明確に正確な更新方向に導きます。このアプローチには2つの主な利点があります：(1) すべてのサンプルを完全に活用することで、トレーニング効率が向上します。(2) トレーニング中にモデルが参照回答の問題解決戦略を模倣できるため、推論における汎化能力が向上します。GCPOは、複数のベンチマークデータセットで優れた結果を達成し、ベースラインモデルに対して大幅な改善をもたらします。私たちのコードは以下で公開されています：https://github.com/AchoWu/GCPO。

English

Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.

GCPO：コントラストが機能しない時は、ゴールドを目指せ

GCPO: When Contrast Fails, Go Gold

要旨

Support