通過獎勵傾斜分布匹配強化少步生成器
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
May 25, 2026
作者: Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang
cs.AI
摘要
近期在少步擴散蒸餾方面的進展已能實現高效圖像生成,但將這些模型與人類偏好對齊仍具挑戰。我們提出獎勵傾斜分佈匹配蒸餾(RTDMD),這是一個兩階段框架,將分佈匹配蒸餾與獎勵引導的強化學習統一應用於少步流生成器。我們證明,最小化與獎勵傾斜教師分佈的KL散度自然分解為分佈匹配項和獎勵最大化項。在第一階段,我們引入環境一致分佈匹配蒸餾(AC-DMD),它執行子區間層面的分佈匹配,並用一致性正則化項增強偽分數目標,幫助偽分數模型在有限更新下追蹤變化的生成器分佈。第二階段,我們聯合優化兩個項:對於獎勵最大化項,我們推導出混合策略梯度,將GRPO風格的估計器用於隨機中間過渡,並通過確定性最終步驟直接反向傳播獎勵,進一步引入步算子集GRPO(SubGRPO)以減少方差。在SD3、SD3.5和FLUX.2上的實驗表明,RTDMD僅需4個推理步驟,就在偏好、美學和組合指標上建立了新的最佳結果,優於先前的少步文本到圖像生成方法。代碼和模型可在 https://github.com/Harahan/RTDMD 獲取。
English
Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.