둘이 필요하다: 당신의 GRPO는 사실 DPO입니다

초록

그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)는 사후 학습된 대규모 언어 모델(Large Language Models, LLMs)을 위한 중요한 강화 학습 알고리즘입니다. 일반적으로 GRPO는 정확한 통계적 추정을 통해 안정적인 학습을 보장하기 위해 큰 그룹 크기가 필요하며, 이는 상당한 계산 오버헤드를 초래한다고 알려져 있습니다. 본 연구에서는 이러한 가정에 도전하여 GRPO를 대조 학습(contrastive learning)의 한 형태로 재해석함으로써, 직접 선호 최적화(Direct Preference Optimization, DPO)와의 근본적인 연결을 밝혀냈습니다. DPO의 실험적 성공에 고무되어, 기존에 실현 불가능하다고 여겨졌던 최소 두 롤아웃(2-GRPO) 사례를 조사했습니다. 우리는 2-GRPO를 검증하기 위해 엄밀한 이론적 분석을 제공하고, 롤아웃 수를 1/8로 줄이고 학습 시간을 70% 이상 단축했음에도 불구하고 16-GRPO와 동등한 성능을 달성함을 실험적으로 입증했습니다.

English

Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.

둘이 필요하다: 당신의 GRPO는 사실 DPO입니다

It Takes Two: Your GRPO Is Secretly DPO

초록

Support