GCPO：當對比失效時，轉向黃金標準

摘要

強化學習已被廣泛應用於提升大型語言模型的推理能力。擴展較小模型的推理限制已成為一個重要的研究焦點。然而，諸如群組相對策略優化（GRPO）等算法存在一個明顯的缺點：模型生成回應的上限完全由模型自身決定，這阻礙了從全部錯誤或全部正確的樣本中獲取知識。本文介紹了一種引入外部標準參考答案的方法——群組對比策略優化（GCPO）。當模型無法解決問題時，參考答案提供正確回應，引導模型朝向明確的更新方向。這種方法具有兩大優勢：（1）通過充分利用每個樣本提高訓練效率；（2）使模型在訓練過程中能夠模仿參考答案的解決策略，從而增強推理的泛化能力。GCPO在多個基準數據集上取得了優異的結果，相較於基線模型有顯著提升。我們的代碼已公開於：https://github.com/AchoWu/GCPO。

English

Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.

GCPO：當對比失效時，轉向黃金標準

GCPO: When Contrast Fails, Go Gold

摘要

Support