Causale Forcing: Autoregressieve Diffusiedistillatie Goed Gedaan voor Hoogwaardige Real-Time Interactieve Videogeneratie

Samenvatting

Om real-time interactieve videogeneratie te bereiken, distilleren huidige methoden vooraf getrainde bidirectionele videodiffusiemodellen naar autoregressieve (AR) modellen met weinig stappen, waarbij een architectuurkloof ontstaat wanneer volledige aandacht wordt vervangen door causale aandacht. Bestaande benaderingen overbruggen deze kloof echter niet theoretisch. Zij initialiseren de AR-student via ODE-distillatie, wat frame-level injectiviteit vereist: elk ruisframe moet onder de PF-ODE van een AR-leraar afbeelden op een uniek schoon frame. Het distilleren van een AR-student uit een bidirectionele leraar schendt deze voorwaarde, waardoor de stroomafbeelding van de leraar niet kan worden hersteld en in plaats daarvan een voorwaardelijke-verwachtingsoplossing ontstaat, wat de prestaties verslechtert. Om dit probleem aan te pakken, stellen wij Causal Forcing voor, dat een AR-leraar gebruikt voor ODE-initialisatie en zo de architectuurkloof overbrugt. Empirische resultaten tonen aan dat onze methode alle referentiemethoden op alle metrieken overtreft, met een verbetering van 19,3% in Dynamic Degree, 8,7% in VisionReward en 16,7% in Instruction Following ten opzichte van de state-of-the-art Self Forcing. Projectpagina en code: https://thu-ml.github.io/CausalForcing.github.io/

English

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: https://thu-ml.github.io/CausalForcing.github.io/{https://thu-ml.github.io/CausalForcing.github.io/}

Causale Forcing: Autoregressieve Diffusiedistillatie Goed Gedaan voor Hoogwaardige Real-Time Interactieve Videogeneratie

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Samenvatting

Support