D-OPSD：基于同策略自蒸馏的步蒸馏扩散模型持续调优方法

摘要

当前，高性能图像生成模型的发展正从低效的多步模型转向高效的少步模型（如Z-Image-Turbo和FLUX.2-klein）。然而，这些模型在直接进行连续监督微调时面临显著挑战。例如，应用常用的微调技术会损害其固有的少步推理能力。为解决这一问题，我们提出了D-OPSD——一种面向步数蒸馏扩散模型的新型训练范式，能够在监督微调过程中实现策略内学习。我们首先发现，当现代扩散模型以LLM/VLM作为编码器时，能够继承其编码器的上下文学习能力。这使得我们可以将训练构建为一个策略内自蒸馏过程。具体而言，在训练过程中，模型通过不同上下文同时扮演教师和学生的角色：学生仅基于文本特征进行条件生成，而教师则基于文本提示与目标图像的多模态特征进行条件生成。训练目标是最小化两个预测分布在学生自身生成轨迹上的差异。通过基于模型自身轨迹进行优化并在其自我监督下学习，D-OPSD使模型能够掌握新概念、风格等能力，同时不牺牲原有的少步生成性能。

English

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.

D-OPSD：基于同策略自蒸馏的步蒸馏扩散模型持续调优方法

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

摘要

Support