ChatPaper.aiChatPaper

Flash-DMD:基于高效蒸馏与联合强化学习的高保真少步图像生成方法

Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

November 25, 2025
作者: Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun
cs.AI

摘要

扩散模型已成为生成模型的主流类别,但其迭代采样过程仍存在计算成本高的问题。时间步蒸馏是一种加速生成的有效技术,但通常需要大量训练且会导致图像质量下降。此外,基于强化学习(RL)针对特定目标(如审美偏好或用户喜好)对这些蒸馏模型进行微调时,存在训练不稳定且易陷入奖励破解的问题。本文提出Flash-DMD新型框架,通过蒸馏实现快速收敛,并结合RL进行联合优化。具体而言,我们首先提出高效的时间步感知蒸馏策略,在仅需DMD2训练成本2.1%的情况下显著提升真实感;其次引入联合训练机制,在持续进行时间步蒸馏的同时,通过RL目标对模型进行微调。实验表明,持续蒸馏产生的稳定、明确损失可作为强正则化项,有效稳定RL训练并避免策略崩溃。基于评分模型和流匹配模型的广泛实验证明,Flash-DMD不仅收敛速度显著提升,在少步采样机制下更达到最优生成质量,在视觉质量、人类偏好和图文对齐指标上均超越现有方法。本研究为训练高效、高保真且稳定的生成模型提供了有效范式。代码即将开源。
English
Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only 2.1% its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
PDF191December 3, 2025