雙劍合璧:你的GRPO實為DPO
It Takes Two: Your GRPO Is Secretly DPO
October 1, 2025
作者: Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie
cs.AI
摘要
群组相对策略优化(Group Relative Policy Optimization, GRPO)是一种针对大型语言模型(Large Language Models, LLMs)训练后阶段的重要强化学习算法。普遍认为,GRPO需要较大的群组规模,通过精确的统计估计来确保训练的稳定性,这导致了显著的计算开销。在本研究中,我们通过将GRPO重新定义为一种对比学习形式,挑战了这一假设,并揭示了其与直接偏好优化(Direct Preference Optimization, DPO)之间的根本联系。受DPO实证成功的启发,我们探讨了先前被认为不可行的最小双轮次配置(2-GRPO)。我们提供了严格的理论分析以验证2-GRPO,并通过实验证明,尽管仅使用了1/8的轮次并减少了超过70%的训练时间,2-GRPO仍能达到与16-GRPO相当的性能。
English
Group Relative Policy Optimization (GRPO) is a prominent reinforcement
learning algorithm for post-training Large Language Models (LLMs). It is
commonly believed that GRPO necessitates a large group size to ensure stable
training via precise statistical estimation, which incurs substantial
computational overhead. In this work, we challenge this assumption by reframing
GRPO as a form of contrastive learning, which reveals a fundamental connection
to Direct Preference Optimization (DPO). Motivated by DPO's empirical success,
we investigate the minimal two-rollout case (2-GRPO), a configuration
previously deemed infeasible. We provide a rigorous theoretical analysis to
validate 2-GRPO and demonstrate empirically that it achieves performance on par
with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time
by over 70%.