ChatPaper.aiChatPaper

MixGRPO:融合ODE-SDE提升基于流的GRPO效率

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

July 29, 2025
作者: Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong
cs.AI

摘要

尽管GRPO在图像生成的人类偏好对齐方面显著提升了流匹配模型的表现,但诸如FlowGRPO等方法仍因需对马尔可夫决策过程(MDP)中指定的所有去噪步骤进行采样与优化而显得效率低下。本文提出MixGRPO,一种创新框架,通过整合随机微分方程(SDE)与常微分方程(ODE),利用混合采样策略的灵活性,简化了MDP内的优化流程,从而提升效率并增强性能。具体而言,MixGRPO引入了滑动窗口机制,仅在窗口内采用SDE采样及GRPO引导的优化,而在窗口外则应用ODE采样。这一设计将采样随机性限制在窗口内的时间步,减少了优化开销,并允许更集中的梯度更新以加速收敛。此外,由于滑动窗口外的时间步不参与优化,支持使用更高阶的求解器进行采样。因此,我们提出了一种更快的变体,称为MixGRPO-Flash,它在保持相当性能的同时,进一步提升了训练效率。MixGRPO在人类偏好对齐的多个维度上展现出显著优势,在效果与效率上均超越DanceGRPO,训练时间减少近50%。值得注意的是,MixGRPO-Flash进一步将训练时间减少了71%。代码与模型可在https://github.com/Tencent-Hunyuan/MixGRPO{MixGRPO}获取。
English
Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose MixGRPO, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed MixGRPO-Flash, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at https://github.com/Tencent-Hunyuan/MixGRPO{MixGRPO}.
PDF82July 31, 2025