成双成对：你的GRPO实为DPO

摘要

群体相对策略优化（GRPO）是一种用于大语言模型（LLM）后训练的重要强化学习算法。普遍认为，GRPO需要较大的群体规模，通过精确的统计估计来确保训练的稳定性，这带来了巨大的计算开销。在本研究中，我们通过将GRPO重新定义为对比学习的形式，挑战了这一假设，揭示了其与直接偏好优化（DPO）之间的根本联系。受DPO实证成功的启发，我们探讨了最小双轮次配置（2-GRPO），这一配置先前被认为不可行。我们提供了严格的理论分析以验证2-GRPO，并通过实验证明，尽管仅使用了1/8的轮次并减少了超过70%的训练时间，其性能与16-GRPO相当。

English

Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.

成双成对：你的GRPO实为DPO

It Takes Two: Your GRPO Is Secretly DPO

摘要

Support