GCPO：当对比失效时，转向黄金标准

摘要

强化学习已被广泛应用于提升大型语言模型的推理能力。扩展较小模型的推理极限已成为一个重要的研究焦点。然而，诸如群体相对策略优化（GRPO）等算法存在一个明显的缺陷：模型生成响应的上限完全由模型自身决定，这阻碍了从全错或全对样本中获取知识。本文提出了一种引入外部标准参考答案的方法——群体对比策略优化（GCPO）。当模型无法解决问题时，参考答案提供正确答案，引导模型朝着明确无误的更新方向前进。该方法具有两大优势：（1）通过充分利用每个样本，提高了训练效率；（2）使模型在训练过程中能够模仿参考答案的解题策略，从而增强推理的泛化能力。GCPO在多个基准数据集上取得了卓越成果，相较于基线模型实现了显著提升。我们的代码已公开于：https://github.com/AchoWu/GCPO。

English

Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.

GCPO：当对比失效时，转向黄金标准

GCPO: When Contrast Fails, Go Gold

摘要

Support