UDM-GRPO:面向均匀离散扩散模型的稳定高效群体相对策略优化
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
April 20, 2026
作者: Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang
cs.AI
摘要
均匀离散扩散模型(UDM)作为离散生成建模的新兴范式展现出广阔前景,但其与强化学习的结合尚未得到充分探索。我们发现直接将GRPO应用于UDM会导致训练不稳定和性能提升有限。为此,我们提出\Ours——首个将UDM与强化学习融合的框架。该方法基于两个关键洞见:(i)将最终纯净样本作为动作可提供更精确稳定的优化信号;(ii)通过扩散前向过程重构轨迹能使概率路径更好对齐预训练分布。此外,我们引入缩减步数策略和无分类器指导策略以进一步提升训练效率。\Ours在多项文生图任务中显著提升基础模型性能:GenEval准确率从69%提升至96%,PickScore从20.46增至23.81,在连续与离散设置下均达到最优水平。在OCR基准测试中,准确率从8%跃升至57%,进一步验证了方法的泛化能力。代码已开源:https://github.com/Yovecent/UDM-GRPO。
English
Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from 69% to 96% and PickScore increases from 20.46 to 23.81, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from 8% to 57%, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO{https://github.com/Yovecent/UDM-GRPO}.