D-OPSD: On-Policy-Selbstdestillation zur kontinuierlichen Feinabstimmung von schrittweise destillierten Diffusionsmodellen

Zusammenfassung

Die Landschaft leistungsstarker Bildgenerierungsmodelle verlagert sich derzeit von ineffizienten Mehrschritt-Modellen hin zu effizienten Modellen mit wenigen Schritten (z.B. Z-Image-Turbo und FLUX.2-klein). Diese Modelle stellen jedoch erhebliche Herausforderungen für ein direktes kontinuierliches supervidiertes Fine-Tuning dar. So würde die Anwendung gängiger Fine-Tuning-Techniken ihre inhärente Fähigkeit zur Inferenz in wenigen Schritten beeinträchtigen. Um dieses Problem zu lösen, schlagen wir D-OPSD vor, ein neuartiges Trainingsparadigma für schritt-distillierte Diffusionsmodelle, das On-Policy-Lernen während des supervidierten Fine-Tunings ermöglicht. Wir stellen zunächst fest, dass moderne Diffusionsmodelle, bei denen ein LLM/VLM als Encoder dient, dessen In-Context-Fähigkeiten erben können. Dies ermöglicht es uns, das Training als On-Policy-Selbstdistillationsprozess zu gestalten. Konkret lässt sich das Modell während des Trainings sowohl als Lehrer als auch als Schüler mit unterschiedlichen Kontexten agieren: Der Schüler wird nur auf Textmerkmale konditioniert, während der Lehrer auf multimodale Merkmale sowohl des Textprompts als auch des Zielbildes konditioniert wird. Das Training minimiert die beiden vorhergesagten Verteilungen über die eigenen Rollouts des Schülers. Durch Optimierung entlang der eigenen Trajektorie des Modells und unter seiner eigenen Aufsicht ermöglicht D-OPSD dem Modell, neue Konzepte, Stile etc. zu erlernen, ohne die ursprüngliche Few-Step-Kapazität zu opfern.

English

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for directly continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromises their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion model where the LLM/VLM serves as the encoder can inherit its encoder's in-context capabilities. This enables us to make the training as an on-policy self-distillation process. Specifically, during training, we make the model acts as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimized on the model's own trajectory and under it's own supervision, D-OPSD enables the model to learn new concept, style, etc. without sacrificing the original few-step capacity.

D-OPSD: On-Policy-Selbstdestillation zur kontinuierlichen Feinabstimmung von schrittweise destillierten Diffusionsmodellen

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

Zusammenfassung

Support