Grootschalige Diffusiedistillatie via Score-Geregulariseerde Continue-Tijd Consistentie

Samenvatting

Dit werk vertegenwoordigt de eerste poging om continue-tijd consistentiedistillatie op te schalen naar algemene toepassingsniveau beeld- en videodiffusiemodellen. Hoewel het continue-tijd consistentiemodel (sCM) theoretisch goed onderbouwd en empirisch krachtig is voor het versnellen van academisch-schaal diffusie, blijft de toepasbaarheid ervan op grootschalige tekst-naar-beeld en videotaken onduidelijk vanwege infrastructurele uitdagingen in Jacobiaan-vectorproduct (JVP) berekeningen en de beperkingen van standaard evaluatiebenchmarks. We ontwikkelen eerst een parallelisme-compatibele FlashAttention-2 JVP-kernel, waardoor sCM-training mogelijk wordt op modellen met meer dan 10 miljard parameters en hoogdimensionale videotaken. Ons onderzoek onthult fundamentele kwaliteitsbeperkingen van sCM in het genereren van fijne details, wat we toeschrijven aan foutaccumulatie en de "mode-covering" aard van zijn forward-divergentiedoelstelling. Om dit te verhelpen, stellen we het score-geregulariseerde continue-tijd consistentiemodel (rCM) voor, dat score-distillatie integreert als een lange-sprong regularisator. Deze integratie vult sCM aan met de "mode-seeking" reverse divergentie, waardoor de visuele kwaliteit effectief wordt verbeterd terwijl een hoge generatiediversiteit behouden blijft. Geverifieerd op grootschalige modellen (Cosmos-Predict2, Wan2.1) tot 14B parameters en 5-seconden video's, evenaart of overtreft rCM de state-of-the-art distillatiemethode DMD2 op kwaliteitsmetingen, terwijl het aanzienlijke voordelen biedt in diversiteit, allemaal zonder GAN-afstemming of uitgebreide hyperparameterzoektochten. De gedistilleerde modellen genereren hoogwaardige samples in slechts 1sim4 stappen, waardoor diffusiebemonstering met 15timessim50 keer wordt versneld. Deze resultaten positioneren rCM als een praktisch en theoretisch onderbouwd raamwerk voor het bevorderen van grootschalige diffusiedistillatie.

English

This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only 1sim4 steps, accelerating diffusion sampling by 15timessim50times. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

Grootschalige Diffusiedistillatie via Score-Geregulariseerde Continue-Tijd Consistentie

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Samenvatting

Support