通過獎勵傾斜分布匹配強化少步生成器

摘要

近期在少步擴散蒸餾方面的進展已能實現高效圖像生成，但將這些模型與人類偏好對齊仍具挑戰。我們提出獎勵傾斜分佈匹配蒸餾（RTDMD），這是一個兩階段框架，將分佈匹配蒸餾與獎勵引導的強化學習統一應用於少步流生成器。我們證明，最小化與獎勵傾斜教師分佈的KL散度自然分解為分佈匹配項和獎勵最大化項。在第一階段，我們引入環境一致分佈匹配蒸餾（AC-DMD），它執行子區間層面的分佈匹配，並用一致性正則化項增強偽分數目標，幫助偽分數模型在有限更新下追蹤變化的生成器分佈。第二階段，我們聯合優化兩個項：對於獎勵最大化項，我們推導出混合策略梯度，將GRPO風格的估計器用於隨機中間過渡，並通過確定性最終步驟直接反向傳播獎勵，進一步引入步算子集GRPO（SubGRPO）以減少方差。在SD3、SD3.5和FLUX.2上的實驗表明，RTDMD僅需4個推理步驟，就在偏好、美學和組合指標上建立了新的最佳結果，優於先前的少步文本到圖像生成方法。代碼和模型可在 https://github.com/Harahan/RTDMD 獲取。

English

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.