UDM-GRPO: Optimización de Políticas Relativas de Grupo Estable y Eficiente para Modelos de Difusión Discretos Uniformes

Resumen

El Modelo de Difusión Discreta Uniforme (UDM) ha surgido recientemente como un paradigma prometedor para el modelado generativo discreto; sin embargo, su integración con el aprendizaje por refuerzo sigue estando en gran parte inexplorada. Observamos que aplicar GRPO de forma ingenua a UDM conduce a inestabilidad en el entrenamiento y ganancias de rendimiento marginales. Para abordar esto, proponemos \Ours, el primer marco que integra UDM con RL. Nuestro método se guía por dos ideas clave: (i) tratar la muestra limpia final como la acción proporciona señales de optimización más precisas y estables; y (ii) reconstruir trayectorias mediante el proceso directo de difusión alinea mejor las rutas de probabilidad con la distribución de preentrenamiento. Adicionalmente, introducimos dos estrategias, Reduced-Step y CFG-Free, para mejorar aún más la eficiencia del entrenamiento. \Ours mejora significativamente el rendimiento del modelo base en múltiples tareas T2I. Notablemente, la precisión en GenEval mejora del 69% al 96% y PickScore aumenta de 20.46 a 23.81, logrando un rendimiento de vanguardia tanto en entornos continuos como discretos. En el benchmark de OCR, la precisión aumenta del 8% al 57%, validando aún más la capacidad de generalización de nuestro método. El código está disponible en https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.

English

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.

UDM-GRPO: Optimización de Políticas Relativas de Grupo Estable y Eficiente para Modelos de Difusión Discretos Uniformes

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Resumen

Support