TDM-R1:基于不可微分奖励增强的少步扩散模型强化
TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward
March 8, 2026
作者: Yihong Luo, Tianyang Hu, Weijian Luo, Jing Tang
cs.AI
摘要
尽管少步数生成模型已能以显著较低成本实现强大的图像与视频生成,适用于少步数模型的通用强化学习范式仍是一个悬而未决的难题。现有针对少步数扩散模型的强化学习方法严重依赖可微分奖励模型的反向传播,从而排除了大多数重要的现实世界奖励信号(如人类二元喜好度、物体数量等不可微分奖励)。为有效整合不可微分奖励以改进少步数生成模型,我们提出了TDM-R1——一种基于领先少步数模型Trajectory Distribution Matching (TDM) 的新型强化学习范式。TDM-R1将学习过程解耦为代理奖励学习与生成器学习两个阶段,并开发了沿TDM确定性生成轨迹获取逐步奖励信号的实用方法,形成统一的强化学习后训练方案,显著提升了少步数模型处理通用奖励的能力。我们在文本渲染、视觉质量与偏好对齐等维度开展了广泛实验,所有结果均表明TDM-R1是少步数文生图模型的强力强化学习范式,在领域内与跨领域指标上均达到最先进的强化学习性能。此外,TDM-R1还能有效适配近期强劲的Z-Image模型,仅用4次网络函数评估即可持续超越其100步与少步数变体。项目页面:https://github.com/Luo-Yihong/TDM-R1
English
While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1