Flash-DMD:基于高效蒸馏与联合强化学习的高保真少步数图像生成技术
Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
November 25, 2025
作者: Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun
cs.AI
摘要
擴散模型已成為生成模型的主流類別,但其迭代採樣過程仍存在計算成本高的問題。時間步蒸餾雖是加速生成的有效技術,但通常需要大量訓練且會導致圖像質量下降。此外,基於強化學習對這些蒸餾模型進行特定目標(如美學吸引力或用戶偏好)的微調,存在著眾所周知的不穩定性,容易陷入獎勵破解困境。本研究提出Flash-DMD新框架,通過蒸餾與聯合強化學習優化實現快速收斂。具體而言,我們首先提出高效的時間步感知蒸餾策略,在僅需DMD2模型2.1%訓練成本的條件下,顯著提升真實感表現;其次設計聯合訓練方案,在持續進行時間步蒸餾的同時,以強化學習目標對模型進行微調。實驗證明,持續蒸餾所產生的穩定明確損失函數可作為強效正則化器,有效穩定強化學習訓練並防止策略崩潰。在基於分數和流匹配模型上的大量實驗表明,Flash-DMD不僅收斂速度顯著提升,在少步採樣機制下更達到頂尖生成質量,於視覺質量、人類偏好和文圖對齊指標上均超越現有方法。本研究為訓練高效、高保真且穩定的生成模型提供了有效範式。代碼即將開源。
English
Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only 2.1% its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.