GCPO:当对比失效时,转向黄金标准
GCPO: When Contrast Fails, Go Gold
October 9, 2025
作者: Hao Wu, Wei Liu
cs.AI
摘要
强化学习已被广泛应用于提升大型语言模型的推理能力。扩展较小模型的推理极限已成为一个重要的研究焦点。然而,诸如群体相对策略优化(GRPO)等算法存在一个明显的缺陷:模型生成响应的上限完全由模型自身决定,这阻碍了从全错或全对样本中获取知识。本文提出了一种引入外部标准参考答案的方法——群体对比策略优化(GCPO)。当模型无法解决问题时,参考答案提供正确答案,引导模型朝着明确无误的更新方向前进。该方法具有两大优势:(1)通过充分利用每个样本,提高了训练效率;(2)使模型在训练过程中能够模仿参考答案的解题策略,从而增强推理的泛化能力。GCPO在多个基准数据集上取得了卓越成果,相较于基线模型实现了显著提升。我们的代码已公开于:https://github.com/AchoWu/GCPO。
English
Reinforcement learning has been widely applied to enhance the reasoning
capabilities of large language models. Extending the inference limits of
smaller models has become a prominent research focus. However, algorithms such
as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the
upper bound of a model's rollout responses is entirely determined by the model
itself, preventing the acquisition of knowledge from samples that are either
all incorrect or all correct. In this paper, we introduce Group Contrastive
Policy Optimization (GCPO), a method that incorporates external standard
reference answers. When the model cannot solve a problem, the reference answer
supplies the correct response, steering the model toward an unequivocally
accurate update direction. This approach offers two main advantages: (1) it
improves training efficiency by fully utilizing every sample; (2) it enables
the model to emulate the problem solving strategy of the reference answer
during training, thereby enhancing generalization in reasoning. GCPO achieves
outstanding results across multiple benchmark datasets, yielding substantial
improvements over the baseline model. Our code is available at:
https://github.com/AchoWu/GCPO.