D-OPSD：基於策略的自我蒸餾技術——用於逐步蒸餾擴散模型的持續調優

摘要

當前高性能影像生成模型的發展趨勢正從低效的多步模型轉向高效的少步模型（例如Z-Image-Turbo和FLUX.2-klein）。然而，這些模型在直接進行連續監督式微調時面臨重大挑戰。舉例而言，若應用常規的微調技術會損害其固有的少步推理能力。為解決此問題，我們提出D-OPSD——一種針對步數蒸餾擴散模型的新型訓練範式，可在監督式微調過程中實現同策略學習。我們首先發現，當現代擴散模型以大型語言模型/視覺語言模型作為編碼器時，能夠繼承其編碼器的情境學習能力。這使我們能將訓練轉化為同策略的自蒸餾過程。具體而言，在訓練期間，模型會基於不同情境同時扮演教師與學生的角色：學生模型僅以文字特徵為條件，而教師模型則以文字提示與目標影像的多模態特徵為條件。訓練目標是最小化學生模型自身推演結果的兩個預測分佈之間的差異。通過在模型自身軌跡上進行優化並實現自我監督，D-OPSD使模型能夠學習新概念、風格等，同時保持原有的少步生成能力。

English

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.

D-OPSD：基於策略的自我蒸餾技術——用於逐步蒸餾擴散模型的持續調優

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

摘要

Support