TDM-R1：利用不可微奖励增强少步扩散模型

摘要

尽管少步数生成模型已能以显著更低的成本实现强大的图像与视频生成，适用于少步数模型的通用强化学习（RL）范式仍是一个悬而未决的难题。现有针对少步数扩散模型的强化学习方法严重依赖对可微分奖励模型的反向传播，从而排除了大多数重要的现实世界奖励信号（例如人类二元喜好度、物体数量等不可微分奖励）。为有效整合不可微分奖励以优化少步数生成模型，我们提出了TDM-R1——一种基于领先少步数模型“轨迹分布匹配（TDM）”的新型强化学习范式。TDM-R1将学习过程解耦为代理奖励学习与生成器学习两个阶段，并开发了实用方法以获取TDM确定性生成轨迹上的逐步奖励信号，最终形成统一的RL后训练方法，显著提升少步数模型处理通用奖励的能力。我们在文本渲染、视觉质量与偏好对齐等多个维度开展了广泛实验，所有结果均表明TDM-R1是少步数文生图模型的强大强化学习范式，在领域内及领域外指标上均达到最先进的强化学习性能。此外，TDM-R1还能有效适配近期强大的Z-Image模型，仅用4次网络函数评估（NFE）即可持续超越其100-NFE版本及少步数变体。项目页面：https://github.com/Luo-Yihong/TDM-R1

English

While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1

TDM-R1：利用不可微奖励增强少步扩散模型

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

摘要

Support