通过奖励倾斜分布匹配强化少步生成器

摘要

少步扩散蒸馏的最新进展已实现高效图像生成，但让这些模型与人类偏好对齐仍具挑战性。我们提出奖励倾斜分布匹配蒸馏（RTDMD）——一种用于少步流生成器的两阶段框架，将分布匹配蒸馏与奖励引导的强化学习相统一。研究表明，最小化与奖励倾斜教师分布的KL散度可自然分解为分布匹配项和奖励最大化项。在第一阶段，我们引入环境一致分布匹配蒸馏（AC-DMD），通过子区间分布匹配，并采用一致性正则化增强伪评分目标，帮助伪评分模型在有限更新次数下追踪动态变化的生成器分布。第二阶段联合优化两项：针对奖励最大化项，我们推导出混合策略梯度，将针对随机中间过渡的GRPO风格估计器与通过确定性最后步骤的直接奖励反向传播相结合，并进一步引入步骤子集GRPO（SubGRPO）以降低方差。在SD3、SD3.5和FLUX.2上的实验表明，RTDMD仅需4步推理即可在偏好、美学和组合度量上创下新的最佳结果，超越先前少步文本到图像生成方法。代码和模型见https://github.com/Harahan/RTDMD。

English

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.