GCPO: 대비가 실패할 때, 골드로 가라

초록

강화 학습은 대규모 언어 모델의 추론 능력을 향상시키기 위해 널리 적용되어 왔습니다. 더 작은 모델의 추론 한계를 확장하는 것은 두드러진 연구 주제로 부상했습니다. 그러나 Group Relative Policy Optimization(GRPO)과 같은 알고리즘은 명확한 단점을 가지고 있습니다: 모델의 롤아웃 응답 상한은 전적으로 모델 자체에 의해 결정되며, 모든 샘플이 잘못되었거나 모두 정확한 경우 지식을 획득할 수 없습니다. 본 논문에서는 외부 기준 참조 답변을 통합한 Group Contrastive Policy Optimization(GCPO) 방법을 소개합니다. 모델이 문제를 해결할 수 없을 때, 참조 답변은 정확한 응답을 제공하여 모델이 명확한 업데이트 방향으로 나아가도록 유도합니다. 이 접근 방식은 두 가지 주요 이점을 제공합니다: (1) 모든 샘플을 완전히 활용하여 훈련 효율성을 향상시키고, (2) 훈련 중에 참조 답변의 문제 해결 전략을 모방할 수 있게 하여 추론에서의 일반화 능력을 강화합니다. GCPO는 여러 벤치마크 데이터셋에서 우수한 결과를 달성하며, 기준 모델 대비 상당한 개선을 보여줍니다. 우리의 코드는 https://github.com/AchoWu/GCPO에서 확인할 수 있습니다.

English

Reinforcement learning has been widely applied to enhance the reasoning capabilities of large language models. Extending the inference limits of smaller models has become a prominent research focus. However, algorithms such as Group Relative Policy Optimization (GRPO) suffer from a clear drawback: the upper bound of a model's rollout responses is entirely determined by the model itself, preventing the acquisition of knowledge from samples that are either all incorrect or all correct. In this paper, we introduce Group Contrastive Policy Optimization (GCPO), a method that incorporates external standard reference answers. When the model cannot solve a problem, the reference answer supplies the correct response, steering the model toward an unequivocally accurate update direction. This approach offers two main advantages: (1) it improves training efficiency by fully utilizing every sample; (2) it enables the model to emulate the problem solving strategy of the reference answer during training, thereby enhancing generalization in reasoning. GCPO achieves outstanding results across multiple benchmark datasets, yielding substantial improvements over the baseline model. Our code is available at: https://github.com/AchoWu/GCPO.

GCPO: 대비가 실패할 때, 골드로 가라

GCPO: When Contrast Fails, Go Gold

초록

Support