UDM-GRPO: Stabiele en Efficiënte Groepsrelatief Beleidsoptimalisatie voor Uniforme Discrete Diffusiemodellen

Samenvatting

Het Uniform Discrete Diffusion Model (UDM) is recentelijk naar voren gekomen als een veelbelovend paradigma voor discrete generatieve modellering; de integratie ervan met reinforcement learning is echter grotendeels onontgonnen. Wij observeren dat een naïeve toepassing van GRPO op UDM leidt tot trainingsinstabiliteit en marginale prestatieverbeteringen. Om dit aan te pakken, stellen wij \Ours voor, het eerste framework dat UDM integreert met RL. Onze methode wordt geleid door twee belangrijke inzichten: (i) het behandelen van het uiteindelijke 'schone' sample als de actie verschaft nauwkeurigere en stabielere optimalisatiesignalen; en (ii) het reconstrueren van trajecten via het forward diffusion-proces zorgt voor een betere afstemming van waarschijnlijkheidspaden met de pre-trainingsdistributie. Daarnaast introduceren wij twee strategieën, Reduced-Step en CFG-Free, om de trainings efficiëntie verder te verbeteren. \Ours verbetert de prestaties van het basismodel aanzienlijk voor diverse T2I-taken. Met name de GenEval-nauwkeurigheid verbetert van 69% naar 96% en de PickScore stijgt van 20.46 naar 23.81, wat state-of-the-art prestaties oplevert in zowel continue als discrete settings. Op de OCR-benchmark stijgt de nauwkeurigheid van 8% naar 57%, wat de generalisatiecapaciteit van onze methode verder valideert. Code is beschikbaar op https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.

English

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.

UDM-GRPO: Stabiele en Efficiënte Groepsrelatief Beleidsoptimalisatie voor Uniforme Discrete Diffusiemodellen

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Samenvatting

Support