MixGRPO: Het ontgrendelen van Flow-based GRPO-efficiëntie met gemengde ODE-SDE

Samenvatting

Hoewel GRPO de flow matching-modellen aanzienlijk verbetert in de uitlijning van menselijke voorkeuren bij beeldgeneratie, vertonen methoden zoals FlowGRPO nog steeds inefficiëntie vanwege de noodzaak om te bemonsteren en te optimaliseren over alle denoising-stappen die zijn gespecificeerd door het Markov Decision Process (MDP). In dit artikel stellen we MixGRPO voor, een nieuw raamwerk dat gebruikmaakt van de flexibiliteit van gemengde bemonsteringsstrategieën door de integratie van stochastische differentiaalvergelijkingen (SDE) en gewone differentiaalvergelijkingen (ODE). Dit stroomlijnt het optimalisatieproces binnen het MDP om de efficiëntie te verbeteren en de prestaties te verhogen. Specifiek introduceert MixGRPO een schuifvenstermechanisme, waarbij SDE-bemonstering en GRPO-gestuurde optimalisatie alleen binnen het venster worden gebruikt, terwijl ODE-bemonstering buiten het venster wordt toegepast. Dit ontwerp beperkt de bemonsteringswillekeur tot de tijdstappen binnen het venster, waardoor de optimalisatie-overhead wordt verminderd en meer gerichte gradientupdates mogelijk zijn om de convergentie te versnellen. Bovendien, omdat tijdstappen buiten het schuifvenster niet betrokken zijn bij de optimalisatie, worden hogere-orde oplossers ondersteund voor bemonstering. Daarom presenteren we een snellere variant, genaamd MixGRPO-Flash, die de trainings efficiëntie verder verbetert terwijl vergelijkbare prestaties worden behaald. MixGRPO laat aanzienlijke verbeteringen zien op meerdere dimensies van menselijke voorkeursuitlijning, waarbij het zowel in effectiviteit als efficiëntie DanceGRPO overtreft, met bijna 50% kortere trainingstijd. Opmerkelijk is dat MixGRPO-Flash de trainingstijd verder reduceert met 71%. Codes en modellen zijn beschikbaar op https://github.com/Tencent-Hunyuan/MixGRPO{MixGRPO}.

English

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose MixGRPO, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed MixGRPO-Flash, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at https://github.com/Tencent-Hunyuan/MixGRPO{MixGRPO}.

MixGRPO: Het ontgrendelen van Flow-based GRPO-efficiëntie met gemengde ODE-SDE

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Samenvatting

Support