ChatPaper.aiChatPaper

通过奖励倾斜分布匹配强化少步生成器

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

May 25, 2026
作者: Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang
cs.AI

摘要

少步扩散蒸馏的最新进展已实现高效图像生成,但让这些模型与人类偏好对齐仍具挑战性。我们提出奖励倾斜分布匹配蒸馏(RTDMD)——一种用于少步流生成器的两阶段框架,将分布匹配蒸馏与奖励引导的强化学习相统一。研究表明,最小化与奖励倾斜教师分布的KL散度可自然分解为分布匹配项和奖励最大化项。在第一阶段,我们引入环境一致分布匹配蒸馏(AC-DMD),通过子区间分布匹配,并采用一致性正则化增强伪评分目标,帮助伪评分模型在有限更新次数下追踪动态变化的生成器分布。第二阶段联合优化两项:针对奖励最大化项,我们推导出混合策略梯度,将针对随机中间过渡的GRPO风格估计器与通过确定性最后步骤的直接奖励反向传播相结合,并进一步引入步骤子集GRPO(SubGRPO)以降低方差。在SD3、SD3.5和FLUX.2上的实验表明,RTDMD仅需4步推理即可在偏好、美学和组合度量上创下新的最佳结果,超越先前少步文本到图像生成方法。代码和模型见https://github.com/Harahan/RTDMD。
English
Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.